Pandas DataFrame

Standard

In the previous entry, I introduced Pandas Series. I also compared it with the column of an Excel workbook. Well, following that analogy, DataFrame is the full Excel workbook, where each column is … you guessed it; a Series.

Structure of a DataFrame

Structure of a DataFrame

Just like in an Excel workbook, all columns share the same index, and there is a list of columns. Each column will have a distinctive name:

>>> import pandas as pd
>>> idx = range(5)
>>> cols = ['A','B','C']
>>> d = pd.DataFrame(index=idx, columns=cols)
>>> print(d)
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

Since there is no data in the DataFrame, all entries show as NaN (Not a Number). We will fill it with data to show the ways we can address the contents of the DataFrame. For instance, we can access individual elements:

>>> d.loc[0,'A'] = 'Hello'
>>> print(d)
       A    B    C
0  Hello  NaN  NaN
1    NaN  NaN  NaN
2    NaN  NaN  NaN
3    NaN  NaN  NaN
4    NaN  NaN  NaN

There are other methods for doing this, but I prefer to do it always in the same manner, so there are no confusions. That’s why in the entry about Series I recommended to use Series.loc[] to access the elements.
We can also set a full column, or a range of values:

>>> d.loc[:,'B'] = 'Column B' # Sets the value of all the entries in the B column
>>> d.loc[1:4,'A'] = 'A' # Sets the value of the A column from index 1 to 4
>>> print(d)
       A         B    C
0  Hello  Column B  NaN
1      A  Column B  NaN
2      A  Column B  NaN
3      A  Column B  NaN
4      A  Column B  NaN

The : works just as in normal Python slices. Summarizing, if you want to select the whole column, use : as the index range, and if you only want a subset, use [start]:[end] (note that, unlike in slices, [end] is included in the selection).
We can also address a whole row:

>>> d.loc[1,:] = 'NO INFO'
>>> print(d)
         A         B        C
0    Hello  Column B      NaN
1  NO INFO   NO INFO  NO INFO
2        A  Column B      NaN
3        A  Column B      NaN
4        A  Column B      NaN

When we use the DataFrame.loc[] operator, it gives us a handle to a Series or DataFrame (if both the index and column fields were set to a range); so all the operations that were introduced for Series can be executed over the result of the call:

>>> d.loc[:,'A'].apply(len)
0    5
1    7
2    1
3    1
4    1
Name: A, dtype: int64

This example applies the len function to each string of column A. It returns a Series with the length of each entry.
The apply() function can be applied over a DataFrame. Instead of working elementwise, the function is applied over each column or row (depending on the axis parameter):

>>> d.apply(len)
A    5
B    5
C    5
dtype: int64
>>> d.apply(len, axis=1)
0    3
1    3
2    3
3    3
4    3
dtype: int64

In order to apply a function elementwise, another method is used: DataFrame.applymap():

>>> d.applymap(type)
               A              B                C
0      
1        
2      
3      
4      

Similarly to what happened in Series, we can filter columns/rows according to certain conditions. For instance, we can choose the rows that satisfy a certain condition for column A:

>>> d.loc[d.loc[:,'A']!='A',:]
         A         B        C
0    Hello  Column B      NaN
1  NO INFO   NO INFO  NO INFO

Here, d.loc[:,’A’]!=’A’ is used for indexing. The result of this expression on its own would be:

>>> d.loc[d.loc[:,'A']!='A',:]
0     True
1     True
2    False
3    False
4    False
Name: A, dtype: bool

The result is a Series that acts as a mask in the index of d, indicating the rows that match the condition.

In this entry we have seen how to create a DataFrame and access its contents. Of course, there is much much more to say about them, especially in the mathematics field, but I will use it to write future entries in this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *