Reading and writing CSV files with Pandas

Standard

CSV (Comma Separated Values) files are a very simple and common format for data sharing. CSV files are simple (albeit sometimes large) text files that contain tables. Each line is a row, and within each row, each value is assigned a column by a separator.

Separators are characters that cannot be found in any of the values; for instance, a comma (the most common separator, hence the name Comma Separated Values) in a file where values are numbers or text without commas; whitespaces in files where there are no texts with whitespaces; or semicolons (the default type of CSV file produced by Excel). CSV files have the advantage that they are easy to process, and can be even read directly with a text editor. The disadvantage is that they are not as efficient in size and speed as binary files.

Python has methods for dealing with CSV files, but in this entry, I will only concentrate on Pandas. When I started, I made the mistake to open files with the standard Python methods, then parse the files and create the DataFrame. This complicates everything unnecesarily, since Pandas covers this use case by default.

There are two functions to deal with CSV files: pandas.read_csv() to load the contents of a CSV file into a DataFrame, and DataFrame.to_csv() to save the contents of a DataFrame in a CSV.

Let’s suppose we have a trig.csv file with the following contents:

X,Sine,Cosine,Tangent
-3.14,0.0,-1.0,0.0
-2.36,-0.7,-0.71,0.99
-1.57,-1.0,0.0,-1255.77
-0.79,-0.71,0.7,-1.01
0.0,0.0,1.0,0.0
0.79,0.71,0.7,1.01
1.57,1.0,0.0,1255.77
2.36,0.7,-0.71,-0.99
3.14,0.0,-1.0,0.0

To load it, we must use pandas.read_csv():

>>> import pandas as pd
>>> trigonometric = pd.read_csv('trig.csv', index_col='X')
>>> print(trigonometric)

       Sine  Cosine  Tangent
X                           
-3.14  0.00   -1.00     0.00
-2.36 -0.70   -0.71     0.99
-1.57 -1.00    0.00 -1255.77
-0.79 -0.71    0.70    -1.01
 0.00  0.00    1.00     0.00
 0.79  0.71    0.70     1.01
 1.57  1.00    0.00  1255.77
 2.36  0.70   -0.71    -0.99
 3.14  0.00   -1.00     0.00

We have explicitly indicated that the first column (titled X) is the index. Pandas by default uses the first row as the column titles, and creates an integer index. For any different behavior, the arguments of pandas.read_csv() must be properly set. See the documentation for more details.

Once the data has been loaded, it can be manipulated, used for calculations, etc …

>>> print(trigonometric.mean())
Sine      -2.467162e-17
Cosine    -1.133333e-01
Tangent    1.011537e-15
dtype: float64
>>> trigonometric.loc[:,'Sine2'] = trigonometric.loc[:,'Sine'].apply(lambda x: np.power(x,2))
>>> trigonometric.loc[:,'Cosine2'] = trigonometric.loc[:,'Cosine'].apply(lambda x: np.power(x,2))
>>> trigonometric.loc[:,'Tangent2'] = trigonometric.loc[:,'Tangent'].apply(lambda x: np.power(x,2))
>>> print(trigonometric)
       Sine  Cosine  Tangent   Sine2  Cosine2      Tangent2
X                                                          
-3.14  0.00   -1.00     0.00  0.0000   1.0000        0.0000
-2.36 -0.70   -0.71     0.99  0.4900   0.5041        0.9801
-1.57 -1.00    0.00 -1255.77  1.0000   0.0000  1576958.2929
-0.79 -0.71    0.70    -1.01  0.5041   0.4900        1.0201
 0.00  0.00    1.00     0.00  0.0000   1.0000        0.0000
 0.79  0.71    0.70     1.01  0.5041   0.4900        1.0201
 1.57  1.00    0.00  1255.77  1.0000   0.0000  1576958.2929
 2.36  0.70   -0.71    -0.99  0.4900   0.5041        0.9801
 3.14  0.00   -1.00     0.00  0.0000   1.0000        0.0000

In this example, we have calculated the average of each column, and then added three new columns that are the squares of the original columns. To do this, we have used the DataFrame.apply() method to apply a lambda function containing a call to numpy.power(). Finally, we can save the results to a new CSV file:

>>> trigonometric.to_csv('trig2.csv')

pandas.to_csv() has many other options, that can be seen in the documentation. Pandas will by default save the index as the first column with a label if it is set (otherwise, it can be added manually), and the first row will contain the column titles.

Pandas has support for other file types (XLS, pickle, etc…), but CSV is the most used type in data science, due to its ease of use and the wide support by many other platforms and applications.

Leave a Reply

Your email address will not be published. Required fields are marked *