11 Pandas data structures

Warning

Any changes you make to the code on this page, including your solutions to exercises, are temporary. If you leave or reload the page, all changes will be lost. If you would like to keep your work, copy and paste your code into a separate file or editor where it can be saved permanently.

In this chapter, we start working with Pandas, a powerful Python library for working with tabular data. Pandas is built around two main data structures: Series and DataFrame.

Before we can use the library, we need to import it. We also import NumPy and Matplotlib Pyplot:

Note that all these libraries have established alias conventions (Pandas is using pd, NumPy np, and Matplotlib Pyplot plt).

11.1 Series

A Series is a one-dimensional array of elements with labels. Unlike lists, all elements in a Series must have the same datatype, referred to as the dtype of the Series. The labels, also known as the index, make accessing and manipulating the data very convenient.

The following example shows how to define a new Series:

If no index is given, Pandas will create one automatically using integers starting from 0:

We can use both label and position to access individual elements and slices, and therefore need a way to distinguish between these two.

We use .loc[] (label based indexing) to access elements by labels:

When specifying a slice with labels, both the start and stop labels are included:

We use .iloc[] (integer-location based indexing) to access by position. This works like list indexing with [], where selection is based on integer position.

Series are mutable:

11.2 DataFrame

A DataFrame is a two-dimensional table with labeled rows and columns. It is similar to a spreadsheet or a database table. Different columns may have different data types (dtypes).

To print a summary of a DataFrame (information about its index, columns, data types, counts, and memory usage) we can use the .info method:

We can use the .shape attribute to get a tuple with the number of rows and columns:

Methods and attributes are both associated with objects, such as a DataFrame. Methods are actions you can perform on an object (like functions) and require parentheses (), whereas attributes are pieces of information stored within the object (like variables) and do not use parentheses.

We can use the .index and .columns attributes to retrieve the row labels and column names:

11.3 Selecting data

To select a specific column, use the [] operator with the column name in brackets:

Note that the result is a Series.

We can also access the column as an attribute of the dataframe, as long as its name is a valid attribute name and does not clash with any existing attributes or methods:

To select multiple columns, we can simply provide their names as a list within the brackets:

Note that if we use a list of column names, the result is a DataFrame.

To select a specific row or rows we can use the .loc[] and .iloc[] indexers:

We can easily retrieve the first or last rows of the dataframe using the .head and .tail methods. We can specify the desired number of rows to return as the argument (if not specified, both these functions return five rows):

The .loc[] operator can also use a boolean array to specify whether or not each row should be included in the selection. This allows us to filter rows based on a condition:

The expression inside the brackets evaluates to a boolean Series. Each value represents whether the corresponding row will be included or excluded from the result.

When combining conditions, use the operators &, |, ~ instead of and, or, not. Each condition must be wrapped in parentheses, for example:

An alternative approach for filtering rows is by using the .query method:

We have already seen how we can use the .loc[] and .iloc[] indexers to select rows. However, these indexers can also be used to access individual cells or other subsets of the dataframe. To do this, we specify the rows(s) and column(s) inside the brackets:

df.loc[row_specification, column_specification]

The row_specification can be a single label, a list of labels, a label-based slice, or a boolean array. Similarly, we can specify columns in various ways, including a single column name, a list of names, or a name-based slice. Let’s explore a few examples to understand this better:

To access a single cell, we can also use the .at[] and .iat[] indexers:

Similar to series, dataframes are mutable:

11.4 Textual, categorical, and date/time data

In addition to numeric data types, there are a few other useful data types worth mentioning.

Let’s start with textual data. By default, Pandas uses the object data type to store a series of text strings. Most of the time it is fine to leave it as is, but we can explicitly convert it to string using the .astype method:

When dealing with categorical (qualitative) variables, it is recommended to use the category data type:

This might be useful for certain scenarios, such as a regression analysis, where we need to substitute categorical variables with dummy/indicator variables:

Finally, let’s mention the datetime data type. In the following example, we convert strings containing dates and times into Python datetime objects:

Using datetime columns simplifies accessing various date and time related properties:

Exercise. Pandas can read and write data stored in various file formats such as CSV, Excel, JSON, HDF5, Parquet, Stata, SAS and SPSS. (You can find more information at https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html.)

In the cell below we are using the read_csv function to read a CSV file named data/weather.csv. This file contains hourly observations of temperature, humidity and air pressure for May 2021 in Uppsala, Sweden.

This function has a few useful parameters we can use. The delimiter parameter allows us to specify the character used as the field delimiter (by default, it is a comma).

The parse_dates parameter allows us to specify a list of columns to be parsed and stored as datetime columns. Using ['date', 'time'] as an element of this list will make Pandas merge these two columns in the file into a single datetime column named date_time in the resulting DataFrame.

Lastly, the index_col parameter allows us to specify the column in the file that will be used as the row labels, i.e., the index of the DataFrame.

Examine the first few rows of the weather_df dataframe and the output of the .info method. Then answer the following questions:

What were the temperature, humidity and air pressure on May 3 at 16:00?
What were the temperature and humidity for each hour on May 17?
How many temperature measurements are below 0?