13 Combining and merging dataframes
Any changes you make to the code on this page, including your solutions to exercises, are temporary. If you leave or reload the page, all changes will be lost. If you would like to keep your work, copy and paste your code into a separate file or editor where it can be saved permanently.
13.1 Concatenating dataframes
To concatenate rows from multiple dataframes, we can use the .concat method:
Exercise. Try changing the column names of df2 and observe the result of concatenation.
We can also concatenate columns by setting axis=1:
pd.concat([df1, df2], axis=1)13.2 Combining dataframes
Consider the following example. We have two dataframes: one contains height information and the other contains weight information:
Note that person D is missing from height_df and person C is missing from weight_df.
To combine these dataframes based on the row labels, we can use the merge function:
The first parameter of merge is called the left frame, the second the right frame. We match rows by comparing labels of the left frame (left_index=True) with labels of the right frame (right_index=True). By default, only rows present in both dataframes are included (this is called an inner join).
If we want to include all rows from the left frame (height_df), even if there are no matching rows in the right frame, we can use a left join. Missing values from the right frame will appear as NaN, which stands for Not a Number:
Similarly, to include all rows from the right frame, even if there are no matching rows in the left frame, we can use a right join:
To include unmatched rows from both frames, we use a (full) outer join:
13.2.1 Matching on columns instead of indices
If we want to use values in a specific column or columns for matching, we use left_on and right_on, respectively:
To merge on multiple columns, pass a list of column names to left_on / right_on.
If the column names in both dataframes are identical, we can use on instead of specifying left_on and right_on.
Note that when multiple columns share the same name in both frames, Pandas automatically adds suffixes _x and _y to distinguish them.
If we omit the left_index, left_on, right_index, right_on and on parameters, Pandas will merge on all common column names.
Exercise. The file data/weather_precipitation.csv contains hourly precipitation measurements for the same period and location as weather_df. However, it only includes hours with precipitation. Load the data into a separate dataframe, and merge it with weather_df to consolidate both weather and precipitation information.
Sample solution