15 Grouping and aggregating

Warning

Any changes you make to the code on this page, including your solutions to exercises, are temporary. If you leave or reload the page, all changes will be lost. If you would like to keep your work, copy and paste your code into a separate file or editor where it can be saved permanently.

15.1 Grouping

During data analysis, we often need to group data based on the values of one or more columns and compute summary statistics for each of these groups.

Let’s explore a few examples of grouping and aggregating using Seaborn’s tips dataset:

For example, how many tips were left by customers on each day? We can create a separate group for each day and then count the number of rows in each group:

The summary statistics methods that we previously used with Series and DataFrames can also be applied to summarize each group:

If we want to compute several summary statistics at once, we can use the .agg method (short for aggregate):

Note that the resulting columns form a hierarchical (multi-level) column index. To access a specific value column and its statistic, we use a tuple containing both levels of the column name:

We can also use the .agg method to specify different summary statistics for different columns:

Another way to use the .agg method is by using named arguments. The argument name serves as the column name in the resulting DataFrame, while the value is a tuple that specifies the source column and the aggregation function:

We can even aggregate data using our own custom summary functions with the .apply method:

This calculates the overall tipping ratio (total tip divided by total bill) for each day.

To group data by multiple columns, you can pass a list of their names to .groupby. In the resulting DataFrame, the row labels form a hierarchy (a MultiIndex).

Sometimes, we might need to group by the index. If the index has no name, we can use df.groupby(df.index) or df.groupby(level=0).

We can also create pivot tables with .pivot_table, which provide a convenient way to summarize data by multiple dimensions:

Exercise. Read the weather data from the data/weather.csv file and display a table showing the daily minimum, maximum, and mean values for both temperature and humidity.

Hint: If your DataFrame uses a DatetimeIndex and you want to group by date, you can use df.groupby(df.index.date). (Unlike for datetime columns, there is no .dt).

15.2 Most useful aggregate functions

size: Number of rows in each group
count: Number of non-null values in each group
nunique: Number of unique values in each group
first: First non-null value in each group
head(n=5): First n rows in each group
last: Last non-null value in each group
tail(n=5): Last n rows in each group
min: Minimum value in each group
idxmin: Index/label of the minimum value in each group
nsmallest(n=5, keep='first'): n smallest values in each group
max: Maximum value in each group
idxmax: Index/label of the maximum value in each group
nlargest(n=5, keep='first'): n largest values in each group
sum: Sum of the values in each group
prod: Product of the values in each group
mean: Mean of the values in each group
median: Median of the values in each group
std: Standard deviation of the values in each group
var: Variance of the values in each group
describe: Descriptive statistics