15  Grouping and aggregating

Warning

Any changes you make to the code on this page, including your solutions to exercises, are temporary. If you leave or reload the page, all changes will be lost. If you would like to keep your work, copy and paste your code into a separate file or editor where it can be saved permanently.

15.1 Grouping

During data analysis, we often need to group data based on the values of one or more columns and compute summary statistics for each of these groups.

Let’s explore a few examples of grouping and aggregating using Seaborn’s tips dataset:

For example, how many tips were left by customers on each day? We can create a separate group for each day and then count the number of rows in each group:

The summary statistics methods that we previously used with Series and DataFrames can also be applied to summarize each group:

If we want to compute several summary statistics at once, we can use the .agg method (short for aggregate):

Note that the resulting columns form a hierarchical (multi-level) column index. To access a specific value column and its statistic, we use a tuple containing both levels of the column name:

We can also use the .agg method to specify different summary statistics for different columns:

Another way to use the .agg method is by using named arguments. The argument name serves as the column name in the resulting DataFrame, while the value is a tuple that specifies the source column and the aggregation function:

We can even aggregate data using our own custom summary functions with the .apply method:

This calculates the overall tipping ratio (total tip divided by total bill) for each day.

To group data by multiple columns, you can pass a list of their names to .groupby. In the resulting DataFrame, the row labels form a hierarchy (a MultiIndex).

Sometimes, we might need to group by the index. If the index has no name, we can use df.groupby(df.index) or df.groupby(level=0).

We can also create pivot tables with .pivot_table, which provide a convenient way to summarize data by multiple dimensions:

Exercise. Read the weather data from the data/weather.csv file and display a table showing the daily minimum, maximum, and mean values for both temperature and humidity.

Hint: If your DataFrame uses a DatetimeIndex and you want to group by date, you can use df.groupby(df.index.date). (Unlike for datetime columns, there is no .dt).

15.2 Most useful aggregate functions

  • size: Number of rows in each group
  • count: Number of non-null values in each group
  • nunique: Number of unique values in each group
  • first: First non-null value in each group
  • head(n=5): First n rows in each group
  • last: Last non-null value in each group
  • tail(n=5): Last n rows in each group
  • min: Minimum value in each group
  • idxmin: Index/label of the minimum value in each group
  • nsmallest(n=5, keep='first'): n smallest values in each group
  • max: Maximum value in each group
  • idxmax: Index/label of the maximum value in each group
  • nlargest(n=5, keep='first'): n largest values in each group
  • sum: Sum of the values in each group
  • prod: Product of the values in each group
  • mean: Mean of the values in each group
  • median: Median of the values in each group
  • std: Standard deviation of the values in each group
  • var: Variance of the values in each group
  • describe: Descriptive statistics