Pandas：根据共享列值删除 NA 行

Question

我有以下数据框

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Country": ["A", "A", "A", "A", "B", "B", "B", "B"],
    "Year": [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
    "Category": [1, 2, 1, 2, 1, 2, 1, 2],
    "Count": [np.nan, np.nan, 1, 2, 3, np.nan, 5, 6]
})

我想删除所有共享 Country 和 Year 列的值并在 Count 列中具有 NaN 值的所有值。所以在这种情况下，要删除行 ID 0 和 1（请注意，不应删除行 5）。

是否可以通过一些内置的 pandas 函数在不循环的情况下实现？

下面的代码达到了预期的效果，但是效率很低（真实的dataframe要大得多）：

for country in df.Country.unique():
    for year in df.Year.unique():
        if df[(df.Country==country) & (df.Year==year)].Count.isna().all(): 
            df.drop(df[(df.Country==country) & (df.Year==year)].index, inplace=True)

是否有更好、更有效的方法？

Answer 1

您可以使用 groupby 和 filter 只保留 'not every count is null'.

的组

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Country": ["A", "A", "A", "A", "B", "B", "B", "B"],
    "Year": [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
    "Category": [1, 2, 1, 2, 1, 2, 1, 2],
    "Count": [np.nan, np.nan, 1, 2, 3, np.nan, 5, 6]
})

df.groupby(['Country','Year']).filter(lambda x: ~x['Count'].isnull().all())

输出

Country  Year  Category  Count
2       A  2021         1    1.0
3       A  2021         2    2.0
4       B  2020         1    3.0
5       B  2020         2    NaN
6       B  2021         1    5.0
7       B  2021         2    6.0

Pandas：根据共享列值删除 NA 行

Pandas: Drop NA rows based on shared column values

python

dataframe

pandas

drop