Drawing graphics is a key step in the process of data analysis; not only for presenting final results to other people, but also to better understand what we have in our hands. As shown in the previous entry, Pandas has a very handy graphical subsystem. Nevertheless, sometimes the simple graphics provided by Pandas may not be enough for a clear visualization. Seaborn comes to the rescue here, providing some specialized graphic types targeted for complex data sets and with emphasis on clean visualization.
%matplotlib inline import seaborn as sns import matplotlib.pyplot as plt import pandas as pd fedata = pd.read_csv('mpg.csv', index_col=0)
Seaborn lets us explore individually each variable with functions such as kdeplot() that draw an approximation to a continuous distribution density based on the samples of data:
sns.kdeplot(fedata['hwy']) sns.kdeplot(fedata['cty']) plt.show()
We can see that generally speaking, city mileage is lower than highway mileage, since there is a higher density in lower values. But we may want to know something more; for instance, do cars that have a higher highway mileage also have a higher city mileage? Or are they optimized for behaving better in one scenario and score worse on the other? We can find that out with bivariate analysis functions such as jointplot() that plots all samples in a single graphic plus their univariate distributions:
This graphic undoubtedly relates both with a high correlation. It is almost certain that if we buy a car with a high city mileage, it will also have a high highway mileage.
In the previous entry, we first found the relation between the number of cylinders and the average fuel economy. It gave us an idea that the higher the number of cylinders, the lower the mileage. But that was just an average. With Seaborn, we can see some more details easily. We can get plots classified by the values of one column using FacetGrid(). FacetGrid() just provides a grid with classified data. We will map() a Matplotlib plot on the resulting grid to actually draw something.
grid = sns.FacetGrid(fedata, col='cyl') grid.map(plt.scatter,'cty','hwy') plt.show()
We see that, generally speaking, it is true that a higher number of cylinders means a lower mileage; but that is not an absolute truth. There are some 8 cylinder cars that have a higher highway and city mileage than some 4 cylinder cars. Actually, among the 4 cylinder cars there is a great variability. With a very small modification of the above code, we can see even more detail. Many Seaborn functions admit a hue parameter that separates data by class. In our case, this is a second type of classification after using the FacetGrid
grid = sns.FacetGrid(fedata, col='cyl', hue='class') grid.map(plt.scatter,'cty','hwy').add_legend() plt.show()
We can see now that there are some classes that generally score better in each class (compacts and subcompacts) whereas some score worse (pickups).
The FacetGrid() can go one step further and classify by a third class dimension using the row parameter:
grid = sns.FacetGrid(fedata, col='cyl', row='year', hue='class') grid.map(plt.scatter,'cty','hwy').add_legend() plt.show()
Seaborn also includes some more sophisticated functionalities. For instance, going back to the idea of trying to find a relation between the unclassified values of city and highway mileage, we can do better with regplot(). We get the same scatter plot plus a linear regression:
grid = sns.PairGrid(fedata) grid.map(plt.scatter) plt.show()
A shorthand for this function is pairplot, that also makes a better use of the diagonal of the grid.
To further explore the possibilities of Seaborn, some good places to go are the tutorial, the gallery and the API reference. There are many more types of representation that we haven’t even mentioned here. Seaborn is especially powerful when used with Pandas. It can be used without it, but then its usefulness is drastically reduced; it is not a general purpose plotting library.