Drawing better graphics with Seaborn

Standard

Drawing graphics is a key step in the process of data analysis; not only for presenting final results to other people, but also to better understand what we have in our hands. As shown in the previous entry, Pandas has a very handy graphical subsystem. Nevertheless, sometimes the simple graphics provided by Pandas may not be enough for a clear visualization. Seaborn comes to the rescue here, providing some specialized graphic types targeted for complex data sets and with emphasis on clean visualization.

We will use the car fuel economy dataset we used the last time.

In [1]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

fedata = pd.read_csv('mpg.csv', index_col=0)

Seaborn lets us explore individually each variable with functions such as kdeplot() that draw an approximation to a continuous distribution density based on the samples of data:

In [2]:
sns.kdeplot(fedata['hwy'])
sns.kdeplot(fedata['cty'])
plt.show()

We can see that generally speaking, city mileage is lower than highway mileage, since there is a higher density in lower values. But we may want to know something more; for instance, do cars that have a higher highway mileage also have a higher city mileage? Or are they optimized for behaving better in one scenario and score worse on the other? We can find that out with bivariate analysis functions such as jointplot() that plots all samples in a single graphic plus their univariate distributions:

In [3]:
sns.jointplot(fedata['cty'],fedata['hwy'])
plt.show()

This graphic undoubtedly relates both with a high correlation. It is almost certain that if we buy a car with a high city mileage, it will also have a high highway mileage.

In the previous entry, we first found the relation between the number of cylinders and the average fuel economy. It gave us an idea that the higher the number of cylinders, the lower the mileage. But that was just an average. With Seaborn, we can see some more details easily. We can get plots classified by the values of one column using FacetGrid(). FacetGrid() just provides a grid with classified data. We will map() a Matplotlib plot on the resulting grid to actually draw something.

In [4]:
grid = sns.FacetGrid(fedata, col='cyl')
grid.map(plt.scatter,'cty','hwy')
plt.show()

We see that, generally speaking, it is true that a higher number of cylinders means a lower mileage; but that is not an absolute truth. There are some 8 cylinder cars that have a higher highway and city mileage than some 4 cylinder cars. Actually, among the 4 cylinder cars there is a great variability. With a very small modification of the above code, we can see even more detail. Many Seaborn functions admit a hue parameter that separates data by class. In our case, this is a second type of classification after using the FacetGrid

In [5]:
grid = sns.FacetGrid(fedata, col='cyl', hue='class')
grid.map(plt.scatter,'cty','hwy').add_legend()
plt.show()

We can see now that there are some classes that generally score better in each class (compacts and subcompacts) whereas some score worse (pickups).

The FacetGrid() can go one step further and classify by a third class dimension using the row parameter:

In [6]:
grid = sns.FacetGrid(fedata, col='cyl', row='year', hue='class')
grid.map(plt.scatter,'cty','hwy').add_legend()
plt.show()

Seaborn also includes some more sophisticated functionalities. For instance, going back to the idea of trying to find a relation between the unclassified values of city and highway mileage, we can do better with regplot(). We get the same scatter plot plus a linear regression:

In [7]:
sns.regplot('cty','hwy',fedata)
plt.show()

One last function that we may use often to quickly overview the properties of a DataFrame is PairGrid, that shows the relation among all pairs of variables.

In [8]:
grid = sns.PairGrid(fedata)
grid.map(plt.scatter)
plt.show()

A shorthand for this function is pairplot, that also makes a better use of the diagonal of the grid.

In [9]:
sns.pairplot(fedata)
plt.show()

To further explore the possibilities of Seaborn, some good places to go are the tutorial, the gallery and the API reference. There are many more types of representation that we haven’t even mentioned here. Seaborn is especially powerful when used with Pandas. It can be used without it, but then its usefulness is drastically reduced; it is not a general purpose plotting library.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.