21  Programming with databases

Warning

Any changes you make to the code on this page, including your solutions to exercises, are temporary. If you leave or reload the page, all changes will be lost. If you would like to keep your work, copy and paste your code into a separate file or editor where it can be saved permanently.

Now that we can create tables and query data directly using SQL, it is time to bring these operations into Python programs. By connecting Python to a database, we can combine the flexibility of Python with the power of SQL. This allows us to automate data workflows, build applications that interact with databases, and even perform advanced analyses that go beyond what we could do with SQL alone.

21.1 Connecting to a database and executing SQL statements

To be able to use a database in our Python programs we need to import and use a package that knows how to communicate with a relational database management system (RDBMS) where our database is stored.

In this chapter we will work with an SQLite3 database. Unlike most client-server database systems, SQLite3 is an embedded database engine – it runs within your application and stores a database in a single file on your computer.

We will use the built-in sqlite3 module:

Fortunately, what we will cover in this chapter can be directly applied to databases in other RDBMSs as well, as long as there exist PEP 249-compliant Python packages for those systems. PEP 249 specifies the application programming interface (API) that these database packages must implement – in other words, what functions, classes, and exceptions they must provide.

First, let’s create a copy of the data/sqltutorial.db database so that we can modify the database and have an easy way to restore it if needed.

If you ever want to start over, just re-run the copyfile command above to overwrite the modified database.

Now, let’s connect to the copied database. The parameters of the connect function depend on the RDBMS. For SQLite3 we need to specify the filename. After connecting, we create a cursor object and use its execute method to run SQL statements.

For SELECT queries, we can iterate over the cursor to get all rows (as tuples) in the result set:

Alternatively, we can use the fetchone, fetchmany or fetchall methods to read one, several, or all rows from the result set:

After each call to execute, the cursor is positioned before the first row of the result set. Executing another SQL statement resets it.

21.2 Prepared statements

To run SQL statements that use potentially unsafe data (e.g. data entered by users) we should always use prepared statements. A prepared statement is an SQL command that uses placeholders (? in sqlite3) that will be replaced by specified values in a secure way. (Note that other packages might use different placeholders, such as %s; see details in PEP 249.)

For example, if we want to find employees born between two given years entered by the user, we can use the following prepared statement:

SELECT * FROM employee WHERE year_of_birth BETWEEN ? AND ?

and specify a tuple with the values (that will replace the placeholders) as the second argument when calling the execute method. (Remember the trailing comma if there is only one value in the tuple):

21.3 SQL injection

Why is it so important to use prepared statements? Consider the following program:

ImportantDo not use f-strings or string formatting to prepare a SQL!

This program does exactly what we want – as long as we enter valid employee IDs. But what happens if you enter the following input?

0 UNION SELECT *, NULL, NULL, NULL, NULL, NULL FROM project

The query that gets executed is actually:

SELECT * FROM employee WHERE id = 0
UNION
SELECT *, NULL, NULL, NULL, NULL, NULL FROM project

returning the following rows:

(1, 'Project Alfa, Uppsala', 210000, None, None, None, None, None)
(2, 'Project Bravo, Knivsta', 190000, None, None, None, None, None)
(3, 'Project Charlie, Stockholm', 420000, None, None, None, None, None)

Short exercise. Why have we used NULLs in *, NULL, NULL, NULL, NULL, NULL in the query above?

This is an example of SQL injection: we were able to inject and execute our own SQL (selecting a list of projects) instead of the SQL in the program (retrieving information about a particular employee).

If executing multiple statements were enabled (and some client libraries or servers do allow this), an attacker could supply input like 0; DROP TABLE employee and cause additional statements to run — potentially damaging the data.

Always use prepared statements! Try the same experiment using this safe version:

21.4 Exceptions

The sqlite3 package supports the following exceptions:

  • Warning
  • Error
    • DatabaseError
      • IntegrityError
      • ProgrammingError
      • OperationalError
      • NotSupportedError

Execute the following cells and notice the exception types. (As an exercise, try to identify errors in the SQL before you execute each example.)

21.5 Closing the cursor and the connection

When we are finished executing SQL statements, we should close both the cursor and the connection:

21.6 Pandas with databases

There is also database support in Pandas. In the database-related functions we will need to specify the connection, and there are several alternatives:

  • An SQLite3 connection (as we have seen above).

  • A database URL in the form: rdbms://username:password@host:port/database.

    For example, mysql://scott:tiger@localhost/foo refers to the database foo stored on a MySQL server at localhost, using the username scott and the password tiger to authenticate.

    For SQLite3 databases we only need to specify the filename in the URL, e.g. sqlite:///data/sqltutorial_copy.db.

  • An SQLAlchemy connection (created by calling sqlalchemy.create_engine with the database URL as the argument).

To execute a query and get the result as a DataFrame we can use the read_sql_query function:

We can also store a data frame (or a series) in the database with the to_sql method: