A very important aspect in data given in time series (such as the dataset used in the time series correlation entry) are trends. Trends indicate a slow change in the behavior of a variable in time, in its average over a long period.
Although consecutive measurements may increase or decrease on an opposed direction, the overall behavior persists over time. An example of this is climate; in the Northern Hemisphere, temperatures might decrease by several degrees in May from one day to the next day, but the overall behavior is warming towards the summer. Understanding trends also enables predictive analysis; hence the warnings of global warming. But it’s not all about natural sciences; trends are very important in markets and business.
In this entry, we will use the same dataset used in the last entry to detect trends in the data.
First, we will load our dataset:
>>> import pandas as pd >>> import matplotlib.pyplot as plt >>> fcdata = pd.read_csv('flotation-cell.csv', index_col=0) >>> print(fcdata.head()) Feed rate Upstream pH CuSO4 added Pulp level \ Date and time 15/12/2004 19:57:01 341.049347 10.820513 7.995605 24.443470 15/12/2004 19:57:31 274.270782 10.827351 7.786569 27.819294 15/12/2004 19:58:01 334.836761 10.854701 7.655922 30.335533 15/12/2004 19:58:32 323.605927 10.885470 7.838828 30.663738 15/12/2004 19:59:03 322.341309 10.851282 7.995605 30.288647 Air flow rate Date and time 15/12/2004 19:57:01 2.802198 15/12/2004 19:57:31 2.798535 15/12/2004 19:58:01 2.805861 15/12/2004 19:58:32 2.802198 15/12/2004 19:59:03 2.805861
Next, we will select the data we want to adjust:
>>> selected = fcdata.loc[('16/12/2004 20:16:00' < fcdata.index) & (fcdata.index < '16/12/2004 20:25:00'),'Feed rate'] >>> selected.plot() >>> plt.show()
So, how can we use Pandas to find trends in this series? Well, there are many ways, but we will be using an additional library (actually a library used by Pandas in its core): NumPy. NumPy has a lot of interesting mathematical functions, and you might want to have a serious look on it. In this case, we will be using a polynomial fitting function, polyfit() to find the best adjusting first degree polynomial and its fitting error. A first degree polynomial has two parameters, that we can call the slope and the offset. If the slope is different from zero, then there is a trend in the data.polyfit() returns several values (see the documentation, but we only need two of them: the coefficients (an array containing the slope and the offset), and the residuals (that gives us a measure of the fitting error and that we will convert into the Normalized Mean Squared Error (NRMSE), that gives us a measurement of the error between 0 and 1).
>>> import numpy as np >>> coefficients, residuals, _, _, _ = np.polyfit(range(len(selected.index)),selected,1,full=True) >>> mse = residuals/(len(selected.index)) >>> nrmse = np.sqrt(mse)/(selected.max() - selected.min()) >>> print('Slope ' + str(coefficients)) >>> print('NRMSE: ' + str(nrmse)) Slope -1.72979024566 NMRSE: 0.274160734073
We can see that a negative trend is detected, of about -1.73 units of the Feed rate per time interval (in this dataset, the time interval is 30 seconds), with an error of about 27.4%. We can now draw the trend line. Remember that the trend line is a polynomial in the Ax+B form:
>>> plt.plot(selected) >>> plt.plot([coefficients*x + coefficients for x in range(len(selected))]) >>> plt.show()
We can see that the trend line approaches clearly the trend of the series, although the noise (the small local variations among one sample and the next) adds some error.
In this entry we have seen another application of the Pandas library. In the future, we will keep getting into more details.
4 thoughts on “Analyzing trends in data with Pandas”
Hey, thank you for a pedagogical post. I have a dataset of Keywords (from academical journals) that I would like to analyze. It is, indeed, a quite big dataset and I need to do some pre-processing (e.g., categorize or group the keywords). My question, however, is if you think it is possible to do a trends analysis, using Pandas and Python, on the keywords?
Maybe I should look for another way? I have been tinkering around with the data using R but I like Python and want to learn more.
Again, thank you for a great post.
thank you for your comment and sorry for the late reply!
To your question, I would definitely answer that yes. Yo need first to get your data right and clean, and once you have it, you can aggregate keyword counts per year. I suppose that your dataset has one entry per paper, with one column being the keywords and another being the year (among others). If that is the case, and let’s suppose you want the trends for one specific keyword, you can use an apply() method on the keyword column to see if it has your word. This will give you a Series of True/False values with one entry per article. You can then group by year and for each year count the Trues. This will give you the number of occurrences per year of that keyword. It would go something like this:
yearly_counts = pd.DataFrame(columns=['python'], index=mydata['year'].unique()) # we create an entry for the count of each keyword and index it by available year
mydata['occurrences_of_python'] = mydata['keywords'].apply(lambda t: 'python' in t)
for year, mydata_yearly in mydata.groupby('year'): # This creates one dataset per year
cnt = mydata_yearly['occurrences_of_python'].value_counts()[True] # Count the places where it was found
yearly_counts.loc[year,'python'] = cnt
This is the basic idea. I’m sure there is an easier way to do it, but right now this is the best I can do. You can extend this code so that it first finds all the keywords and then does that process per keyword. In that case make sure to override the ‘occurrences_of_python’ column (name it differently) so that your data structure doesn’t get that big.
If performance is an issue, you could also process row by row, parse the keywords and add a counter for each one. This would go like this:
yearly_counts = pd.DataFrame(index=mydata['year'].unique())
for i in mydata.index:
year = mydata.loc[i, 'year']
for w in mydata.loc[i, 'keywords'].split(' '): #Here I'm supposing your keywords are separated by a whitespace. If not, make sure to change the separator and deal with quotes if any
if w in yearly_counts.columns and not pd.isnull(yearly_counts.loc[year, w]):
yearly_counts.loc[year,w] += 1
yearly_counts.loc[year,w] = 1
You can then analyze trends using years and count of each keyword. I hope I helped you. Again, sorry for the late reply, this is in fact my first real comment and I did not see it among all the spam!
If you need more information don’t hesitate to contact me.
Thank you for your reply. No worries about replying late. I have got plenty of other stuff to do (pursuing a Ph.D. in Psychology for instance). I will see if I can figure out a way to use my data according to your pointers. If I do, I will get back to you.
For me plt. plot(selected.values) work, before it I was getting an error that my dataframe does not have datetime .