Parsing comma-separtated-values (CSV) files is a common task. There are many tools available in Python to deal with this. Let's start by using the built-in csv
module.
import csv # using Python module
Now, we want to open an example file, and read the contents:
f = open('example_1.csv','rb') # use binary mode if on MS windows
d = [i for i in csv.reader(f) ] # use list comprehension to read from file
f.close() # close file
print d
[['Name', 'DOB', 'Years', 'Degree'], ['Alice Jones', '12/1/1980', '3', 'MS'], ['Bob Smith', '1/1/1969', '4', 'BS'], ['John Book', '5/3/1980', '11', 'BA'], ['Billy Blanks', '6/9/2000', '8', 'AA']]
This reads the rows of the CSV file into a list-of-lists.
Now, let's suppose we want to create columns for last name
and first name
instead of having just one name
field. The first element in the list d
is the header, so we had the additional fields there:
d[0].append('Last Name')
d[0].append('First Name')
print d[0]
['Name', 'DOB', 'Years', 'Degree', 'Last Name', 'First Name']
Now, we want to split the original Name
field into first and last names and put these at the ends of their respective rows.
for row in d[1:]: # start at 1st, not 0th column. Each row is a list
first,last= row[0].split() # split on white-space
row.append(last) # append to each row
row.append(first)
print d
[['Name', 'DOB', 'Years', 'Degree', 'Last Name', 'First Name'], ['Alice Jones', '12/1/1980', '3', 'MS', 'Jones', 'Alice'], ['Bob Smith', '1/1/1969', '4', 'BS', 'Smith', 'Bob'], ['John Book', '5/3/1980', '11', 'BA', 'Book', 'John'], ['Billy Blanks', '6/9/2000', '8', 'AA', 'Blanks', 'Billy']]
#%qtconsole
Now, we want to write out our new data in CSV format
f = open('example_1_out.csv','wb') # write mode binary
fw = csv.writer(f) # create csv writer
fw.writerows(d)
f.close() # close file
Now, opening the file example_1_out.csv
using excel (or another reader) should show the new column. That covers the most direct and pure-Python way to dealing with CSV files. However, there are many other tools available. For example, numpy
provides power methods to access these files.
Let's see how to accomplish the same work as above by using Numpy.
import numpy as np
d = np.loadtxt('example_1.csv',delimiter=',',dtype=str)
print d
print d.dtype
[['Name' 'DOB' 'Years' 'Degree'] ['Alice Jones' '12/1/1980' '3' 'MS'] ['Bob Smith' '1/1/1969' '4' 'BS'] ['John Book' '5/3/1980' '11' 'BA'] ['Billy Blanks' '6/9/2000' '8' 'AA']] |S12
Notice that we did not have to use open
to get at the contents of the file. By default, the delimiter
is any whitespace, so we had to change this to the comma character. The dtype
specifies we want everything to be read in as a string. Numpy can figure out how long that string needs to be as it goes through the file so we don't have to specify that ahead of time (we probably don't know it anyway). In this case, the dtype
turns out to be a twelve character string S12
. Note that this is the maximum length is used for all strings so there is obviously a lot of extra space if most of the strings are short and just a few are long.
Numpy provides many more ways of reading data via the dtype
. For example,
dt = [('name', 'S64'), # The first element of the tuple is our name for each respective column
('dob', 'S64'), # and the second element is the numpy dtype we want for that column.
('years', 'int'), # Here we want years as an integer, not a string
('degree', 'S64'), ]
d = np.loadtxt('example_1.csv',delimiter=',',dtype=dt,skiprows=1) # skip the header row
print d
[('Alice Jones', '12/1/1980', 3, 'MS') ('Bob Smith', '1/1/1969', 4, 'BS') ('John Book', '5/3/1980', 11, 'BA') ('Billy Blanks', '6/9/2000', 8, 'AA')]
The advantage of doing it this way is that now we can compute the years
column using numpy
tools. For example, here is the mean of the years.
print d['years'].mean() # using numpy arrays
6.5
Now to get back to the main task at hand: splitting the name field into first and last name.
import string
n=map(string.split,d['name'])
w=array([tuple(i)+tuple(j) for i,j in zip(d,n)], # list comprehension glues tuple-ized rows together
dtype=dt+[('first','S64'),('last','S64')]) # append new dtypes to existing list of dtypes
That was kind of non-simple, but now we can write this to a CSV using savetxt
.
# the comments are set to '' to avoid hash marks on the first line.
np.savetxt('np_output.csv',w,delimiter=',',fmt='%s',header='name,dob,years,degree,first,last',comments='')
Now, you can inspect the so-generated file and verify it is a CSV.
pandas
is the real power tool for this job.
import pandas as pd
d = pd.read_csv('example_1.csv')
print d
print type(d)
Name DOB Years Degree 0 Alice Jones 12/1/1980 3 MS 1 Bob Smith 1/1/1969 4 BS 2 John Book 5/3/1980 11 BA 3 Billy Blanks 6/9/2000 8 AA <class 'pandas.core.frame.DataFrame'>
Now, we have read the CSV file as a pandas
DataFrame which is a super-structure that sits on top of numpy
. Let's examine the columns of this DataFrame.
print d.columns
Index([Name, DOB, Years, Degree], dtype=object)
Notice that there is an extra space after the Name
. This potentially makes it hard to access the columns using pandas slicing. For example,
print d.DOB # this works great when the column header name has no spaces in it.
print d['Name'] # you can also refer to columns using this syntax
0 12/1/1980 1 1/1/1969 2 5/3/1980 3 6/9/2000 Name: DOB, dtype: object 0 Alice Jones 1 Bob Smith 2 John Book 3 Billy Blanks Name: Name, dtype: object
Luckily, this is not hard to fix. We just need to create another column that is free of these trailing spaces:
d['name']=d['Name '] # easily create extra column
print d.name # now you can access this column using this syntax
Pandas is a lot more powerful than this! We can parse the columns by types individually by providing a dtype
for each column as a dictionary.
d = pd.read_csv('example_1.csv',dtype={'Name ':'S64','DOB':'S64','Years':int,'Degree':'S64'})
print d
Now, we can compute along the columns as we did before with numpy
.
print d.Years.mean()
You can also parse the DOB
field to get a true timestamp instead of a string using the parse_dates
keyword.
d = pd.read_csv('example_1.csv',dtype={'Name':'S64',
'DOB':'S64',
'Years':int,
'Degree':'S64'},parse_dates=[1])
Now, we can compute with these datetime
objects as in the following.
# difference in birthdays between Alice Jones and John Book
print d.DOB[0] - d.DOB[2]
212 days, 0:00:00
Now we now how many days are between the respective birthdays of Alice Jones and John Book.
%qtconsole
d['first']=map(lambda x:string.split(x)[0],d['Name'])
d['last']=map(lambda x:string.split(x)[1],d['Name'])
print d
Name DOB Years Degree first last 0 Alice Jones 1980-12-01 00:00:00 3 MS Alice Jones 1 Bob Smith 1969-01-01 00:00:00 4 BS Bob Smith 2 John Book 1980-05-03 00:00:00 11 BA John Book 3 Billy Blanks 2000-06-09 00:00:00 8 AA Billy Blanks
print d
Name DOB Years Degree first last 0 Alice Jones 1980-12-01 00:00:00 3 MS Alice Jones 1 Bob Smith 1969-01-01 00:00:00 4 BS Bob Smith 2 John Book 1980-05-03 00:00:00 11 BA John Book 3 Billy Blanks 2000-06-09 00:00:00 8 AA Billy Blanks
import pandas.io.sql as pd_sql
import sqlite3 as sql # sqlite3 is built into Python
con = sql.connect("example_1.db")
pd_sql.write_frame(d,'data',con) # write to DB as table named "data"
con.close()