In this entry, we will see a practical application of the Pandas library. We will use a DataFrame where we will load the contents of a CSV file containing data of measurements on a flotation cell.
A flotation cell is used by mining companies for isolating metals from the extracted mixtures by using differences in their hydrophobicity.
The dataset (download link) has a timestamp index, and each column is a time-dependent variable, measuring magnitudes such as the feed rate or air flow rate.
We will find the correlation among all the variables. This will give us a hint on how much related they are, and ultimately tell us if there are common factors affecting them.
First, we load the data from the CSV file:
>>> import pandas as pd >>> fcdata = pd.read_csv('flotation-cell.csv', index_col=0) >>> print(fcdata.head()) Feed rate Upstream pH CuSO4 added Pulp level \ Date and time 15/12/2004 19:57:01 341.049347 10.820513 7.995605 24.443470 15/12/2004 19:57:31 274.270782 10.827351 7.786569 27.819294 15/12/2004 19:58:01 334.836761 10.854701 7.655922 30.335533 15/12/2004 19:58:32 323.605927 10.885470 7.838828 30.663738 15/12/2004 19:59:03 322.341309 10.851282 7.995605 30.288647 Air flow rate Date and time 15/12/2004 19:57:01 2.802198 15/12/2004 19:57:31 2.798535 15/12/2004 19:58:01 2.805861 15/12/2004 19:58:32 2.802198 15/12/2004 19:59:03 2.805861
We have first loaded the file explicitly telling read_csv() that the first column is the index, as indicated in the documentation. DataFrame.head() shows the first five rows of the DataFrame, and it is a very useful function for getting a feel of the data when inspecting a new one. We can now draw a plot of the data:
>>> import matplotlib.pyplot as plt >>> fcdata.plot() >>> plt.show()
The graphic is not beautiful, but it is informative. To get better plots, we can use raw Matplotlib or better yet Seaborn.
Next, we will find the correlation among the parameters. Fortunately, Pandas has a builtin function for this: DataFrame.correl().
>>> fcdata.corr() Feed rate Upstream pH CuSO4 added Pulp level Air flow rate Feed rate 1.000000 0.163943 0.438234 0.036742 0.026665 Upstream pH 0.163943 1.000000 0.188979 0.088289 0.040845 CuSO4 added 0.438234 0.188979 1.000000 -0.013927 0.008808 Pulp level 0.036742 0.088289 -0.013927 1.000000 0.158203 Air flow rate 0.026665 0.040845 0.008808 0.158203 1.000000
This table shows that the most correlated variables are the Feed rate and CuSO4 added. This is visible in the plot. When there is a drop in the average feed rate, a CuSO4 drop is also observed. In this case, we extracted te Pearson coefficient, but in the documentation we can see that other coefficients can be extracted.
In this entry, we have seen a very simple application of what we have seen in previous entries about Pandas. In the next entry we will see yet another interesting application.