19  Data Query Language

Warning

Any changes you make to the code on this page, including your solutions to exercises, are temporary. If you leave or reload the page, all changes will be lost. If you would like to keep your work, copy and paste your code into a separate file or editor where it can be saved permanently.

In this chapter we introduce the SELECT statement, or more formally, a part of SQL known as Data Query Language (DQL). We will start with simple SELECT statements, and gradually introduce more features. All examples will use a sample database with the following tables:

Table department:

Table employee:

19.1 Queries involving one table

We start with queries using one table:

SELECT column [, ...]
FROM table
[WHERE condition]

In this chapter, [] denotes optional parts, ... is used to denote repetitions (like multiple columns separated by commas) or parts of queries not relevant when introducing specific SQL syntax, and | separates possible choices, indicating that one of the options should be selected.

SQL generally ignores extra whitespace (like new lines or multiple spaces) outside of string literals. Keywords are case-insensitive, so SELECT is the same as select. Some database systems might require a semicolon at the end of a query.

The condition in the WHERE clause is a boolean expression evaluated for every row. If true, the row is included; if false, it is skipped. No WHERE clause means all rows are included.

We also need to specify which columns to include in the result. To include all columns, we can use *.

Note that to check for equality we use = (not == as in Python or other programming languages).

We can also use expressions with columns:

To rename a column in the result, we can use AS followed by the new name:

19.1.1 DISTINCT

The following query finds IDs of departments that have employees with hourly salary over 230:

In this case, we might not want to include the same department ID more than once.

To remove duplicates, we can use SELECT DISTINCT:

19.1.2 Conditions

We can combine conditions using AND, OR and NOT:

As mentioned above, to check for equality we use =. For inequality, we use <>. Some database systems also accept !=, but this is not following the SQL standard and may reduce portability of your queries across different systems.

Other useful operators include:

  • IS [NOT] NULL
    True if the operand is [not] missing, e.g., hourly_salary IS NULL

    Never use = NULL nor <> NULL!!!

    CautionWrong code
    SELECT *
    FROM employee
    WHERE hourly_salary = NULL
  • [NOT] BETWEEN ... AND ...
    True if the operand is [not] in a range of values (both bounds are inclusive), e.g. hourly_salary BETWEEN 100 AND 200

  • [NOT] IN (...)
    True if the operand is [not] in a set, e.g. department_id IN (1, 2, 3)

  • [NOT] LIKE pattern
    True if the operand does [not] match a pattern. The pattern is a text string including two special characters (wildcards):

    • _ representing any character
    • % representing any sequence of characters (any substring)

    For example, last_name LIKE 'A%' matches all rows where the last_name starts with A.

    Do not use LIKE for equality checks.

19.1.3 Ordering

We can order rows in the result with ORDER BY:

SELECT ...
FROM ...
[WHERE ...]
[ORDER BY column [ASC | DESC] [, ...]]

ASC stands for ascending (the default) and DESC for descending. In addition to a column name, we can also order by a column alias or a column expression.

Examples:

19.2 Queries involving several tables

We can combine rows from multiple tables with JOIN:

SELECT ...
FROM table_1
JOIN table_2 ON join_condition
[WHERE ...]
[ORDER BY ...]

The join_condition specifies which rows from the first table combine with which rows from the second one.

You can think about this as combining every row from the first table with every row from the second table and only considering those combinations where join_condition is true.

Very often, these conditions express the same relationships as those defined by foreign keys. Even if the database already enforces these relationships, we still must specify the conditions explicitly.

For example, the employee’s department_id is referencing the id column in the department table. To get information about the employees together with the title of the department, we can run:

Note that we can use employee.* to include all columns from the employee table.

Since the columns are coming from several tables, we might need to use both the table name and the column name, separated by ., to specify a column, e.g. department.id. Including the table name is optional if the column name is not ambiguous (that is, if the column name exists only in one of the tables). Still, it is a good practice to include table names, as it improves the readability.

We can specify aliases for the tables:

If we join a table with itself (for example to get information about both employees and their supervisors), aliases are necessary.

Conditions for inclusion still go in the WHERE clause:

A few final notes about the joins:

  • We can use multiple JOIN clauses to join more than two tables.

  • Old-style “comma” joins specifying all tables in FROM and including join conditions in WHERE are equivalent, but error-prone (beginners tend to forget to include the join conditions):

19.3 Set operators

We can use UNION, INTERSECT and EXCEPT (or MINUS in some systems) to calculate the union, intersection and set difference between the result sets of multiple queries. All queries must have the same number of columns.

The most useful (and commonly used) of these operators is UNION, which combines rows from multiple queries:

SELECT ...
UNION [ALL]
SELECT ...

Without ALL the duplicates are removed.

Distinct rows common to all queries can be obtained with INTERSECT:

SELECT ...
INTERSECT
SELECT ...

Distinct rows in the result of the first query that are not included in the result of the second query can be obtained with EXCEPT (or MINUS):

SELECT ...
EXCEPT
SELECT ...

Unlike UNION, the INTERSECT and EXCEPT are not used often. Some database systems might not even support these operations. We can use nested queries (covered below) to obtain the same result.

19.4 Aggregate functions

We can calculate summary statistics with aggregate functions such as count, sum, avg, min, max. The column name (or a column expression) is specified as the argument: For example:

The count function can also use * as the parameter to count the number of rows:

Note for beginners: Using an aggregate function in a query means the result will not include individual rows from the table, but only a single row with the summary statistics. That also means that all “columns” specified after SELECT must be aggregates:

Aggregate functions (with exception of count(*)) ignore NULL values. Consider the following query:

There are 17 rows in the table, but only 13 non-null values in the supervisor_id column, and only 4 distinct (non-null) values.

19.5 Grouping

We can divide rows (that meet the WHERE condition) into groups based on the value in one or more columns and calculate summary statistics for each group with a GROUP BY clause:

SELECT ...
FROM ...
[JOIN ... ON ...]
[WHERE ...]
[GROUP BY column [, ...]]
[ORDER BY ...]

Columns in the result (after SELECT) must be either:

  • group statistics (aggregates), or
  • grouping columns (or expressions using them) and other columns (or expressions based on them) that are functionally dependent on these

For example, to find the number of employees and the average salary in each department:

Some database systems may allow us to include additional columns in the result. However, if we select columns that are not part of the grouping or an aggregate function, the database will return one of the (potentially many different) values from each group. The specific value returned depends on the database system’s internal processing and should not be relied upon. This behavior can lead to unexpected results and is a common source of errors.

19.6 Filtering groups

We can filter groups using HAVING:

SELECT ...
FROM ...
[JOIN ... ON ...]
[WHERE ...]
[GROUP BY ...]
[HAVING condition]
[ORDER BY ...]

For example, to find the maximum hourly salary in departments with at least 5 employees we can use the following query:

Question: Why cannot we use count(*) >= 5 in the WHERE clause instead?

19.7 Outer joins

An inner join like

SELECT *
FROM A
JOIN B ON B.attr_b = A.attr_a

returns only those combinations of rows from A and B that satisfy the join condition.

An outer left join

SELECT *
FROM A
LEFT JOIN B ON B.attr_b = A.attr_a

includes all rows in A, even if there is no matching row in B (using NULL for all columns in B).

Similarly, an outer right join

SELECT *
FROM A
RIGHT JOIN B ON B.attr_b = A.attr_a

includes all rows in B, even if there is no matching row in A (using NULL for all columns in A).

There is also a full outer join, though it is rarely used, and many database systems do not support it.

Question: Some database systems do not support RIGHT JOIN. What can we do instead?

19.8 Nested queries

Let us start with a motivating example and try to find employees with the lowest hourly salary. Consider this query:

Does this work? It will only show one row in the result set (due to LIMIT 1). If multiple employees have the same lowest salary, we will see only one of them.

We can divide the problem into two steps. First, find the lowest salary:

The result is 170. Now, we can find all employees with this salary:

This works for the current state of the database, but if another employee with a lower salary is added later, the query will not return the correct result. Fortunately, we can use a nested query (returning a single value) to dynamically find the intended value (instead of “hardcoding” 170):

Similarly, we can use a nested query returning values in a single column to be used with IN. For example, to select departments with employees whose last name starts with K.

19.8.1 Correlated queries

A nested query can also use columns from the tables in the outer query. Such queries are called correlated or synchronized, for example:

This might be difficult to grasp at first. Unlike previous examples, the nested query cannot produce a single fixed result without reference to the outer row. The WHERE condition (in the outer query) is used to determine if a given row from table e is included in the result. For example, consider this row:

id last_name first_name year_of_birth department_id hourly_salary supervisor_id note
1 Alnes Bernt 1967 1 200 2

The condition for inclusion is:

hourly_salary > (
    SELECT avg(hourly_salary)
    FROM employee
    WHERE department_id=e.department_id
)

For this row, e.department is 1, so the condition becomes:

hourly_salary > (
    SELECT avg(hourly_salary)
    FROM employee
    WHERE department_id=1
)

The nested query calculates the average salary in department 1 (208.75). Since 200 is not greater than this value, the row is not included in the result.

19.8.2 Useful operators involving nested queries

  • (=|<>|<|<=|>|>=) ANY (nested_query)

    True if the operand satisfies the comparison with any of the values returned by the nested query.

    Note that expr = ANY (nested_query) is equivalent to expr IN (nested_query).

  • (=|<>|<|<=|>|>=) ALL (nested_query)

    True if the operand satisfies the comparison with all values returned by the nested query.

    Note that expr <> ALL (nested_query) is equivalent to expr NOT IN (nested_query).

  • EXISTS (nested_query)

    True if the nested query returns at least one row.

  • NOT EXISTS (nested_query)

    True if the nested query returns no rows.

For example, to select the worst paid employees in each department we can use the following query:

19.9 Common Table Expressions

A common table expression (CTE) allows us to use the result of a query as if it were a table in another query:

WITH cte_name AS (SELECT ...)
SELECT ...
FROM cte_name
...

The CTE exists only during the execution of that query. (There is also a related concept called views, which allow a query result to behave like a table for any query.)

CTEs can often be used instead of nested queries to make queries easier to read and understand.

19.9.1 Ordinary CTE

For example, to select the minimum average hourly salary across all departments, we can create a CTE named avg_salaries that calculates the average salary for each department, and then use it in the main query to find the minimum value:

19.9.2 Recursive CTE

CTEs can also be recursive, allowing the CTE to refer to itself in its definition:

WITH RECURSIVE cte_name AS (
    SELECT ... -- initial SELECT
    UNION ALL
    SELECT ... -- recursive SELECT (can reference "cte_name")
)
SELECT ...

The result of a recursive CTE is generated iteratively:

  1. The initial SELECT creates the first batch of rows.
  2. The recursive SELECT runs repeatedly to generate new batches of rows to be added to the result, with cte_name representing the last batch rows added in the previous iteration (not all rows generated so far).
  3. The recursion stops when the recursive SELECT produces no new rows.

The following example shows how to generate numbers from 1 to 5 using a recursive CTE:

The initial SELECT generates the first row. The recursive SELECT uses the last added row (the cte represents the rows added in the last iteration) to add a new row with the last value plus 1. This is repeated until the condition n < 5 is no longer met.

Remember to always include a condition in the recursive SELECT so that recursion terminates.

Recursive CTEs are especially useful for working with hierarchical data, such as organizational charts or category trees. For example, to select all employees and their level in the organizational hierarchy: