In the previous entry, I introduced Pandas Series. I also compared it with the column of an Excel workbook. Well, following that analogy, DataFrame is the full Excel workbook, where each column is … you guessed it; a Series.

Just like in an Excel workbook, all columns share the same index, and there is a list of columns. Each column will have a distinctive name:

>>> import pandas as pd >>> idx = range(5) >>> cols = ['A','B','C'] >>> d = pd.DataFrame(index=idx, columns=cols) >>> print(d) A B C 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN

Since there is no data in the DataFrame, all entries show as NaN (Not a Number). We will fill it with data to show the ways we can address the contents of the DataFrame. For instance, we can access individual elements:

>>> d.loc[0,'A'] = 'Hello' >>> print(d) A B C 0 Hello NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN

There are other methods for doing this, but I prefer to do it always in the same manner, so there are no confusions. That’s why in the entry about Series I recommended to use Series.loc[] to access the elements.

We can also set a full column, or a range of values:

>>> d.loc[:,'B'] = 'Column B' # Sets the value of all the entries in the B column >>> d.loc[1:4,'A'] = 'A' # Sets the value of the A column from index 1 to 4 >>> print(d) A B C 0 Hello Column B NaN 1 A Column B NaN 2 A Column B NaN 3 A Column B NaN 4 A Column B NaN

The : works just as in normal Python slices. Summarizing, if you want to select the whole column, use : as the index range, and if you only want a subset, use [start]:[end] (note that, unlike in slices, [end] is included in the selection).

We can also address a whole row:

>>> d.loc[1,:] = 'NO INFO' >>> print(d) A B C 0 Hello Column B NaN 1 NO INFO NO INFO NO INFO 2 A Column B NaN 3 A Column B NaN 4 A Column B NaN

When we use the DataFrame.loc[] operator, it gives us a handle to a Series or DataFrame (if both the index and column fields were set to a range); so all the operations that were introduced for Series can be executed over the result of the call:

>>> d.loc[:,'A'].apply(len) 0 5 1 7 2 1 3 1 4 1 Name: A, dtype: int64

This example applies the len function to each string of column A. It returns a Series with the length of each entry.

The apply() function can be applied over a DataFrame. Instead of working elementwise, the function is applied over each column or row (depending on the axis parameter):

>>> d.apply(len) A 5 B 5 C 5 dtype: int64 >>> d.apply(len, axis=1) 0 3 1 3 2 3 3 3 4 3 dtype: int64

In order to apply a function elementwise, another method is used: DataFrame.applymap():

>>> d.applymap(type) A B C 01 2 3 4

Similarly to what happened in Series, we can filter columns/rows according to certain conditions. For instance, we can choose the rows that satisfy a certain condition for column A:

>>> d.loc[d.loc[:,'A']!='A',:] A B C 0 Hello Column B NaN 1 NO INFO NO INFO NO INFO

Here, d.loc[:,’A’]!=’A’ is used for indexing. The result of this expression on its own would be:

>>> d.loc[d.loc[:,'A']!='A',:] 0 True 1 True 2 False 3 False 4 False Name: A, dtype: bool

The result is a Series that acts as a mask in the index of d, indicating the rows that match the condition.

In this entry we have seen how to create a DataFrame and access its contents. Of course, there is much much more to say about them, especially in the mathematics field, but I will use it to write future entries in this blog.