In this entry, I will do a brief introduction to Pandas, a library that I have been using the last year for my data analysis needs.
Pandas brings the simplicity and elegance of Python to data analysis. It is part of the Scipy collection of libraries, that include other libraries for scientific computation.
There are two main data structures in Pandas:
- Series: indexed one-dimensional array.
- DataFrame: collection of several Series that share a common index. Each Series is a column in the DataFrame, while a row is made up of all entries sharing the same index value.
In future entries, these data structures will be further explained. Pandas provides functionality to use and manipulate the information stored in these structures. A very simple analogy is that DataFrame act as Excel spreadsheets.
The way of working with Pandas revolves around these objects. Once the data has been mapped in an object, it may be manipulated by using the available methods, such as Series.mean() for calculating the average of a Series, or
DataFrame.apply() that applies a function over each of the columns or rows of the DataFrame. Of course, one of the most important actions that can be performed are access to individual values of the structures, which can be performed with Series.loc[] or DataFrame.loc[]. These functions make the Pandas data structures very powerful, since they permit multiple selection and conditional selection of data.
Pandas completes the functionality of manipulating data with a proper set of functions for reading and writing data from files. It supports many formats, such as CSV and XLS, but also MySQL databases.
Finally, some very basic data visualization capabilities are included in Pandas, but usually external libraries will do a much better job.
In future entries, I will expand on specific aspects of Pandas, focusing on the functions that I have used the most.