Analyzing trends in data with Pandas

Standard

A very important aspect in data given in time series (such as the dataset used in the time series correlation entry) are trends. Trends indicate a slow change in the behavior of a variable in time, in its average over a long period.

Although consecutive measurements may increase or decrease on an opposed direction, the overall behavior persists over time. An example of this is climate; in the Northern Hemisphere, temperatures might decrease by several degrees in May from one day to the next day, but the overall behavior is warming towards the summer. Understanding trends also enables predictive analysis; hence the warnings of global warming. But it’s not all about natural sciences; trends are very important in markets and business.
In this entry, we will use the same dataset used in the last entry to detect trends in the data.
First, we will load our dataset:

>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> fcdata = pd.read_csv('flotation-cell.csv', index_col=0)
>>> print(fcdata.head())
                      Feed rate  Upstream pH  CuSO4 added  Pulp level  \
Date and time                                                           
15/12/2004 19:57:01  341.049347    10.820513     7.995605   24.443470   
15/12/2004 19:57:31  274.270782    10.827351     7.786569   27.819294   
15/12/2004 19:58:01  334.836761    10.854701     7.655922   30.335533   
15/12/2004 19:58:32  323.605927    10.885470     7.838828   30.663738   
15/12/2004 19:59:03  322.341309    10.851282     7.995605   30.288647   

                     Air flow rate  
Date and time                       
15/12/2004 19:57:01       2.802198  
15/12/2004 19:57:31       2.798535  
15/12/2004 19:58:01       2.805861  
15/12/2004 19:58:32       2.802198  
15/12/2004 19:59:03       2.805861

Next, we will select the data we want to adjust:

>>> selected = fcdata.loc[('16/12/2004 20:16:00' < fcdata.index) & (fcdata.index < '16/12/2004 20:25:00'),'Feed rate']
>>> selected.plot()
>>> plt.show()
Original time series

Original time series

So, how can we use Pandas to find trends in this series? Well, there are many ways, but we will be using an additional library (actually a library used by Pandas in its core): NumPy. NumPy has a lot of interesting mathematical functions, and you might want to have a serious look on it. In this case, we will be using a polynomial fitting function, polyfit() to find the best adjusting first degree polynomial and its fitting error. A first degree polynomial has two parameters, that we can call the slope and the offset. If the slope is different from zero, then there is a trend in the data.polyfit() returns several values (see the documentation, but we only need two of them: the coefficients (an array containing the slope and the offset), and the residuals (that gives us a measure of the fitting error and that we will convert into the Normalized Mean Squared Error (NRMSE), that gives us a measurement of the error between 0 and 1).

>>> import numpy as np
>>> coefficients, residuals, _, _, _ = np.polyfit(range(len(selected.index)),selected,1,full=True)
>>> mse = residuals[0]/(len(selected.index))
>>> nrmse = np.sqrt(mse)/(selected.max() - selected.min())
>>> print('Slope ' + str(coefficients[0]))
>>> print('NRMSE: ' + str(nrmse))

Slope -1.72979024566
NMRSE: 0.274160734073

We can see that a negative trend is detected, of about -1.73 units of the Feed rate per time interval (in this dataset, the time interval is 30 seconds), with an error of about 27.4%. We can now draw the trend line. Remember that the trend line is a polynomial in the Ax+B form:

>>> plt.plot(selected)
>>> plt.plot([coefficients[0]*x + coefficients[1] for x in range(len(selected))])
>>> plt.show()
Data with the trend line

Data with the trend line

We can see that the trend line approaches clearly the trend of the series, although the noise (the small local variations among one sample and the next) adds some error.

In this entry we have seen another application of the Pandas library. In the future, we will keep getting into more details.

3 thoughts on “Analyzing trends in data with Pandas

  1. Hey, thank you for a pedagogical post. I have a dataset of Keywords (from academical journals) that I would like to analyze. It is, indeed, a quite big dataset and I need to do some pre-processing (e.g., categorize or group the keywords). My question, however, is if you think it is possible to do a trends analysis, using Pandas and Python, on the keywords?
    Maybe I should look for another way? I have been tinkering around with the data using R but I like Python and want to learn more.

    Again, thank you for a great post.

    • Emil

      Hello Erik,
      thank you for your comment and sorry for the late reply!
      To your question, I would definitely answer that yes. Yo need first to get your data right and clean, and once you have it, you can aggregate keyword counts per year. I suppose that your dataset has one entry per paper, with one column being the keywords and another being the year (among others). If that is the case, and let’s suppose you want the trends for one specific keyword, you can use an apply() method on the keyword column to see if it has your word. This will give you a Series of True/False values with one entry per article. You can then group by year and for each year count the Trues. This will give you the number of occurrences per year of that keyword. It would go something like this:

      yearly_counts = pd.DataFrame(columns=['python'], index=mydata['year'].unique()) # we create an entry for the count of each keyword and index it by available year
      mydata['occurrences_of_python'] = mydata['keywords'].apply(lambda t: 'python' in t)
      for year, mydata_yearly in mydata.groupby('year'): # This creates one dataset per year
      cnt = mydata_yearly['occurrences_of_python'].value_counts()[True] # Count the places where it was found
      yearly_counts.loc[year,'python'] = cnt

      This is the basic idea. I’m sure there is an easier way to do it, but right now this is the best I can do. You can extend this code so that it first finds all the keywords and then does that process per keyword. In that case make sure to override the ‘occurrences_of_python’ column (name it differently) so that your data structure doesn’t get that big.
      If performance is an issue, you could also process row by row, parse the keywords and add a counter for each one. This would go like this:

      yearly_counts = pd.DataFrame(index=mydata['year'].unique())
      for i in mydata.index:
      year = mydata.loc[i, 'year']
      for w in mydata.loc[i, 'keywords'].split(' '): #Here I'm supposing your keywords are separated by a whitespace. If not, make sure to change the separator and deal with quotes if any
      if w in yearly_counts.columns and not pd.isnull(yearly_counts.loc[year, w]):
      yearly_counts.loc[year,w] += 1
      else:
      yearly_counts.loc[year,w] = 1

      You can then analyze trends using years and count of each keyword. I hope I helped you. Again, sorry for the late reply, this is in fact my first real comment and I did not see it among all the spam!
      If you need more information don’t hesitate to contact me.

      • Thank you for your reply. No worries about replying late. I have got plenty of other stuff to do (pursuing a Ph.D. in Psychology for instance). I will see if I can figure out a way to use my data according to your pointers. If I do, I will get back to you.

Leave a Reply

Your email address will not be published. Required fields are marked *