Correlating time series with Pandas


In this entry, we will see a practical application of the Pandas library. We will use a DataFrame where we will load the contents of a CSV file containing data of measurements on a flotation cell.

A flotation cell is used by mining companies for isolating metals from the extracted mixtures by using differences in their hydrophobicity.
The dataset (download link) has a timestamp index, and each column is a time-dependent variable, measuring magnitudes such as the feed rate or air flow rate.
We will find the correlation among all the variables. This will give us a hint on how much related they are, and ultimately tell us if there are common factors affecting them.

First, we load the data from the CSV file:

>>> import pandas as pd
>>> fcdata = pd.read_csv('flotation-cell.csv', index_col=0)
>>> print(fcdata.head())
                      Feed rate  Upstream pH  CuSO4 added  Pulp level  \
Date and time                                                           
15/12/2004 19:57:01  341.049347    10.820513     7.995605   24.443470   
15/12/2004 19:57:31  274.270782    10.827351     7.786569   27.819294   
15/12/2004 19:58:01  334.836761    10.854701     7.655922   30.335533   
15/12/2004 19:58:32  323.605927    10.885470     7.838828   30.663738   
15/12/2004 19:59:03  322.341309    10.851282     7.995605   30.288647   

                     Air flow rate  
Date and time                       
15/12/2004 19:57:01       2.802198  
15/12/2004 19:57:31       2.798535  
15/12/2004 19:58:01       2.805861  
15/12/2004 19:58:32       2.802198  
15/12/2004 19:59:03       2.805861

We have first loaded the file explicitly telling read_csv() that the first column is the index, as indicated in the documentation. DataFrame.head() shows the first five rows of the DataFrame, and it is a very useful function for getting a feel of the data when inspecting a new one. We can now draw a plot of the data:

>>> import matplotlib.pyplot as plt
>>> fcdata.plot()
Dataset plot

Dataset plot


The graphic is not beautiful, but it is informative. To get better plots, we can use raw Matplotlib or better yet Seaborn.
Next, we will find the correlation among the parameters. Fortunately, Pandas has a builtin function for this: DataFrame.correl().

>>> fcdata.corr()
               Feed rate  Upstream pH  CuSO4 added  Pulp level  Air flow rate
Feed rate       1.000000     0.163943     0.438234    0.036742       0.026665
Upstream pH     0.163943     1.000000     0.188979    0.088289       0.040845
CuSO4 added     0.438234     0.188979     1.000000   -0.013927       0.008808
Pulp level      0.036742     0.088289    -0.013927    1.000000       0.158203
Air flow rate   0.026665     0.040845     0.008808    0.158203       1.000000

This table shows that the most correlated variables are the Feed rate and CuSO4 added. This is visible in the plot. When there is a drop in the average feed rate, a CuSO4 drop is also observed. In this case, we extracted te Pearson coefficient, but in the documentation we can see that other coefficients can be extracted.
In this entry, we have seen a very simple application of what we have seen in previous entries about Pandas. In the next entry we will see yet another interesting application.

Leave a Reply

Your email address will not be published. Required fields are marked *