In the previous entry, I introduced Pandas Series. I also compared it with the column of an Excel workbook. Well, following that analogy, DataFrame is the full Excel workbook, where each column is … you guessed it; a Series.
Just like in an Excel workbook, all columns share the same index, and there is a list of columns. Each column will have a distinctive name:
>>> import pandas as pd >>> idx = range(5) >>> cols = ['A','B','C'] >>> d = pd.DataFrame(index=idx, columns=cols) >>> print(d) A B C 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN
Since there is no data in the DataFrame, all entries show as NaN (Not a Number). We will fill it with data to show the ways we can address the contents of the DataFrame. For instance, we can access individual elements:
>>> d.loc[0,'A'] = 'Hello' >>> print(d) A B C 0 Hello NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN
There are other methods for doing this, but I prefer to do it always in the same manner, so there are no confusions. That’s why in the entry about Series I recommended to use Series.loc[] to access the elements.
We can also set a full column, or a range of values:
>>> d.loc[:,'B'] = 'Column B' # Sets the value of all the entries in the B column >>> d.loc[1:4,'A'] = 'A' # Sets the value of the A column from index 1 to 4 >>> print(d) A B C 0 Hello Column B NaN 1 A Column B NaN 2 A Column B NaN 3 A Column B NaN 4 A Column B NaN
The : works just as in normal Python slices. Summarizing, if you want to select the whole column, use : as the index range, and if you only want a subset, use [start]:[end] (note that, unlike in slices, [end] is included in the selection).
We can also address a whole row:
>>> d.loc[1,:] = 'NO INFO' >>> print(d) A B C 0 Hello Column B NaN 1 NO INFO NO INFO NO INFO 2 A Column B NaN 3 A Column B NaN 4 A Column B NaN
When we use the DataFrame.loc[] operator, it gives us a handle to a Series or DataFrame (if both the index and column fields were set to a range); so all the operations that were introduced for Series can be executed over the result of the call:
>>> d.loc[:,'A'].apply(len) 0 5 1 7 2 1 3 1 4 1 Name: A, dtype: int64
This example applies the len function to each string of column A. It returns a Series with the length of each entry.
The apply() function can be applied over a DataFrame. Instead of working elementwise, the function is applied over each column or row (depending on the axis parameter):
>>> d.apply(len) A 5 B 5 C 5 dtype: int64 >>> d.apply(len, axis=1) 0 3 1 3 2 3 3 3 4 3 dtype: int64
In order to apply a function elementwise, another method is used: DataFrame.applymap():
>>> d.applymap(type) A B C 01 2 3 4
Similarly to what happened in Series, we can filter columns/rows according to certain conditions. For instance, we can choose the rows that satisfy a certain condition for column A:
>>> d.loc[d.loc[:,'A']!='A',:] A B C 0 Hello Column B NaN 1 NO INFO NO INFO NO INFO
Here, d.loc[:,’A’]!=’A’ is used for indexing. The result of this expression on its own would be:
>>> d.loc[d.loc[:,'A']!='A',:] 0 True 1 True 2 False 3 False 4 False Name: A, dtype: bool
The result is a Series that acts as a mask in the index of d, indicating the rows that match the condition.
In this entry we have seen how to create a DataFrame and access its contents. Of course, there is much much more to say about them, especially in the mathematics field, but I will use it to write future entries in this blog.