12 Basic operations with dataframes

Warning

Any changes you make to the code on this page, including your solutions to exercises, are temporary. If you leave or reload the page, all changes will be lost. If you would like to keep your work, copy and paste your code into a separate file or editor where it can be saved permanently.

In this chapter, we cover common operations on Pandas DataFrames, including column expressions, sorting, indexing, adding or dropping rows and columns, and summarizing data.

12.1 Expressions using columns

We can perform calculations or logical comparisons on entire columns:

These operations are vectorized, meaning they are applied to all rows without the need for explicit loops.

12.2 Sorting rows

We can sort rows by values in a specific column using .sort_values:

To sort in descending order:

Many DataFrame methods, including .sort_values, support two models:

Chaining (default): Returns a new DataFrame without modifying the original (df). This enables operations to be chained (using the output of one operation as the input for the next):
```
weather_df.query("temperature < 0").sort_values("temperature")
```
Inplace operation: Modifies the original DataFrame (df) and does not return anything. To perform the operation inplace, use the inplace=True parameter.

We can also sort rows by their labels using .sort_index:

12.3 Resetting and setting the index

We can reset the index, converting row labels to a column and switching to integer labels starting from 0, with .reset_index:

To set an existing column as the index, we use .set_index:

Exercise. Sort the weather data by temperature in descending order.

12.4 Adding, dropping and renaming columns

We can add columns in-place:

To create a new DataFrame with additional columns (without modifying the original), we can use .assign:

We can remove columns using .drop:

To rename columns, we use .rename and pass a dictionary that maps the old names (keys) to the new names (values) using the columns parameter:

Exercise. Extend the weather data with a new column for the dew point, \(T_\mathrm{dp}\), using the following approximation:

\[ \begin{aligned} \gamma &= \ln ({H / 100}) + {\frac {bT}{c+T}}, \\ T_\mathrm{dp} &= {\frac {c\gamma}{b-\gamma}}, \end{aligned} \]

where \(H\) is the relative humidity, \(T\) is the temperature in °C, \(b=18.678\), and \(c=257.14\) °C.

12.5 Removing and adding rows

We can remove rows using .drop:

Multiple rows can be removed simultaneously by specifying the labels as as a list, e.g. df_sorted.drop(["Alice", "Daniel"]).

We can add new rows with unique labels using .loc[]:

The row values could also be specified as a Series.

To add multiple rows, or rows with duplicate labels, we can use the .concat method (described in the next chapter).

12.6 Summarizing data

We can obtain descriptive statistics for numeric columns using .describe:

If we want to include all columns, we can use:

To calculate the pairwise correlation among numeric columns, we can use the .corr method:

To obtain the counts of unique values within a specific column, we can use .value_counts:

If we need the number of unique elements in a column, we can use .nunique:

We can compute aggregate statistics using methods such as .count, .sum, .prod, .min, .max, .median, .mean, .var, .std, and .quantile. For example:

To find the label corresponding to (the first occurrence of) the minimum or the maximum value, we can use .idxmin and .idxmax:

We can also apply summary methods to multiple columns simultaneously:

Exercise. Using the weather data introduced above, examine the output of .describe(). Then, answer the following questions:

What were the minimum and maximum temperatures for the entire month?
What were the minimum and maximum temperatures on May 17.
Plot a graph depicting the temperature on May 17 (hint: obtain a series with the relevant temperatures and use the .plot method).