14 Handling duplicates and missing data

Warning

Any changes you make to the code on this page, including your solutions to exercises, are temporary. If you leave or reload the page, all changes will be lost. If you would like to keep your work, copy and paste your code into a separate file or editor where it can be saved permanently.

14.1 Handling duplicates

Sometimes, a DataFrame may contain rows that are repeated. For example:

We can remove duplicate rows using the .drop_duplicates method:

We often want to apply this operation in place, using inplace=True. If we also want to reset the index after removing duplicates, we can use ignore_index=True.

By default, duplicates are detected by comparing values in all columns. If we only want to consider certain columns, we can specify them using the subset parameter.

Consider the following DataFrame, which contains both duplicated entries and inconsistent information about Alice’s city:

If we use .drop_duplicates() without specifying any subset, Pandas will keep both rows for Alice because they differ in the city column.

Since the id column uniquely identifies each person, we can remove duplicates based only on that column:

By default, Pandas keeps the first occurrence of each duplicate. If we prefer to keep the last occurrence, we can set keep='last'.

Tip: Before removing duplicates, you can check which rows are duplicated:

The .duplicated method supports both the subset and keep parameters.

14.2 Handling missing data

Missing values are common in real-world datasets. In Pandas, they are represented using NaN (np.nan), a special floating-point value that stands for “not a number”.

Let’s create an example DataFrame with some missing values:

Note that even if a column otherwise contains only integers, the presence of NaN causes Pandas to store it as a float. (In the newer Pandas versions, you can consider using the Int64 dtype for nullable integers.)

You might think we can check for missing values with == np.nan, but this doesn’t work. NaN is never equal to anything, not even itself (this is part of the IEEE 754 floating-point standard):

To check for missing values, we can use .isna (or its alias .isnull):

To check for non-missing values, we can use .notna (or .notnull).

Real world datasets sometimes use placeholders (like “NA”, “?”, “-”) for missing data. You can tell Pandas to interpret these as missing values when reading data, e.g.

pd.read_csv("data.csv", na_values=["NA", "?", "-"])

14.2.1 Approaches to handle missing data

How we handle missing data depends on the situation. Common approaches include:

Drop rows with missing values:
Fill missing values with a specified value:
Fill missing values with a column statistic, like the mean of non-missing values:
Interpolate missing data:

The interpolate method supports various interpolation techniques; see the documentation for details.

All these methods support inplace=True for modifying the DataFrame directly.

Tip:

Use .dropna when missing values are rare and random.
Use .fillna when a meaningful constant or summary statistic makes sense.
Use .interpolate when the data varies smoothly over time.

Exercise. Import the dataset data/weather_unclean.csv. Use the .describe and .info methods to check for missing data or other anomalies.

Plot temperature and humidity over time, and then plot histograms for both variables (use .plot.hist() on a Series, and specify the number of bins with the bins parameter).

Some temperature measurements have a value of -50, indicating sensor failures.

Your task is to handle these missing or erroneous values in a reasonable way.