Loading a DataFrame from an SQL database

Standard

In a prior entry we described how to load data from a CSV file. In this entry, we will do a quick introduction on how to load the contents of an SQL table into a DataFrame.

With that purpose, we will use the DataFrame.read_sql() method.
The first thing that we need is an active MySQL (or similar) database. We need to establish a connection to the database using SQL Alchemy. SQL Alchemy is a very powerful library for managing databases; definitely get to know it better if you have the time. You will also need to install a mysql driver; in this case, we will use PyMySQL. You can install it using pip or the package manager of your choice.

>>> from sqlalchemy import create_engine
>>> engine = create_engine('mysql+pymysql://username:password@hostname:3306/dbname')

Next, with the established connection, we open a table or execute an SQL query with the DataFrame.read_sql() method.

>>> with engine.connect() as conn, conn.begin():
>>>     data = pd.read_sql('mytable', conn)
>>>     print(data)
      User  Age            Email
0    Alice   21    alice@foo.com
1      Bob   11   bob@sponge.com
2  Charlie   32  charles@baz.com

We have used ‘mytable’ to select a whole table, but a specific SQL query can be executed as well:

>>> with engine.connect() as conn, conn.begin():
>>>     data = pd.read_sql('select * from mytable where Age>18', conn)
>>>     print(data)
      User  Age            Email
0    Alice   21    alice@foo.com
1  Charlie   32  charles@baz.com

Once the data is in the DataFrame, it can be manipulated using all the available methods in Pandas.

>>> data['Domain'] = data.apply(lambda x: x['Email'].split('@')[1],axis=1)
>>> print(data)

      User  Age            Email      Domain
0    Alice   21    alice@foo.com     foo.com
1      Bob   11   bob@sponge.com  sponge.com
2  Charlie   32  charles@baz.com     baz.com

The results can be saved again in a database table with DataFrame.to_sql(). There is a limitation in this functionality (and please correct me if I am wrong): it cannot update existing entries a table. It can only save on a new table, override an existing one or append data.

>>> with engine.connect() as conn, conn.begin():
>>>     data.to_sql('mytable_new',conn,index=False)

We have seen the basic functionality for reading and writing SQL databases. With this, we can implement all that SQL queries provide us. Remember that once the results of a query are returned, we won’t have access to the data that is in the table but not in the results (unless we call DataFrame.read_sql() again).

To see all the options, have a look at the documentation of read_sql() and to_sql().

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.