Pandas:根据共享列值删除 NA 行
Pandas: Drop NA rows based on shared column values
我有以下数据框
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Country": ["A", "A", "A", "A", "B", "B", "B", "B"],
"Year": [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
"Category": [1, 2, 1, 2, 1, 2, 1, 2],
"Count": [np.nan, np.nan, 1, 2, 3, np.nan, 5, 6]
})
我想删除所有共享 Country
和 Year
列的值并在 Count
列中具有 NaN
值的所有值。所以在这种情况下,要删除行 ID 0 和 1(请注意,不应删除行 5)。
是否可以通过一些内置的 pandas 函数在不循环的情况下实现?
下面的代码达到了预期的效果,但是效率很低(真实的dataframe要大得多):
for country in df.Country.unique():
for year in df.Year.unique():
if df[(df.Country==country) & (df.Year==year)].Count.isna().all():
df.drop(df[(df.Country==country) & (df.Year==year)].index, inplace=True)
是否有更好、更有效的方法?
您可以使用 groupby
和 filter
只保留 'not every count is null'.
的组
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Country": ["A", "A", "A", "A", "B", "B", "B", "B"],
"Year": [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
"Category": [1, 2, 1, 2, 1, 2, 1, 2],
"Count": [np.nan, np.nan, 1, 2, 3, np.nan, 5, 6]
})
df.groupby(['Country','Year']).filter(lambda x: ~x['Count'].isnull().all())
输出
Country Year Category Count
2 A 2021 1 1.0
3 A 2021 2 2.0
4 B 2020 1 3.0
5 B 2020 2 NaN
6 B 2021 1 5.0
7 B 2021 2 6.0
我有以下数据框
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Country": ["A", "A", "A", "A", "B", "B", "B", "B"],
"Year": [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
"Category": [1, 2, 1, 2, 1, 2, 1, 2],
"Count": [np.nan, np.nan, 1, 2, 3, np.nan, 5, 6]
})
我想删除所有共享 Country
和 Year
列的值并在 Count
列中具有 NaN
值的所有值。所以在这种情况下,要删除行 ID 0 和 1(请注意,不应删除行 5)。
是否可以通过一些内置的 pandas 函数在不循环的情况下实现?
下面的代码达到了预期的效果,但是效率很低(真实的dataframe要大得多):
for country in df.Country.unique():
for year in df.Year.unique():
if df[(df.Country==country) & (df.Year==year)].Count.isna().all():
df.drop(df[(df.Country==country) & (df.Year==year)].index, inplace=True)
是否有更好、更有效的方法?
您可以使用 groupby
和 filter
只保留 'not every count is null'.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Country": ["A", "A", "A", "A", "B", "B", "B", "B"],
"Year": [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
"Category": [1, 2, 1, 2, 1, 2, 1, 2],
"Count": [np.nan, np.nan, 1, 2, 3, np.nan, 5, 6]
})
df.groupby(['Country','Year']).filter(lambda x: ~x['Count'].isnull().all())
输出
Country Year Category Count
2 A 2021 1 1.0
3 A 2021 2 2.0
4 B 2020 1 3.0
5 B 2020 2 NaN
6 B 2021 1 5.0
7 B 2021 2 6.0