将 DataFrame 中的 NA 替换为每个国家/地区的平均值
Replace NA in DataFrame for multiple columns with mean per country
我想用同年其他列的平均值替换 NA 值。
注意:
要替换加拿大数据的 NA 值 ,我只想使用加拿大的平均值,当然不是整个数据集的平均值。
这是一个充满随机数的示例数据框。还有一些 NA 我是如何在我的数据框中找到它们的:
Country
Inhabitants
Year
Area
Cats
Dogs
Canada
38 000 000
2021
4
32
21
Canada
37 000 000
2020
4
NA
21
Canada
36 000 000
2019
3
32
21
Canada
NA
2018
2
32
21
Canada
34 000 000
2017
NA
32
21
Canada
35 000 000
2016
3
32
NA
Brazil
212 000 000
2021
5
32
21
Brazil
211 000 000
2020
4
NA
21
Brazil
210 000 000
2019
NA
32
21
Brazil
209 000 000
2018
4
32
21
Brazil
NA
2017
2
32
21
Brazil
207 000 000
2016
4
32
NA
pandas 用其他年份的平均值替换那些 NA 的最简单方法是什么?是否可以编写一个代码,使其可以遍历每个 NA 并立即替换它们(居民、区域、猫、狗)?
注意 示例基于评论中的附加数据源
将多列的 NA 值替换为 mean()
您可以结合以下三种方法:
根据您的示例创建数据框:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
nan
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
调用 fillna()
并遍历按国家/地区名称分组的所有列:
df = df.fillna(df.groupby('Country name').transform('mean'))
查看您在加拿大的成绩:
df[df['Country name'] == 'Canada']
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
0.93547
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
这也有效:
在[2]中:
df = pd.read_excel('DataPanelWHR2021C2.xls')
在[3]中:
# Check for number of null values in df
df.isnull().sum()
出[3]:
Country name 0
year 0
Life Ladder 0
Log GDP per capita 36
Social support 13
Healthy life expectancy at birth 55
Freedom to make life choices 32
Generosity 89
Perceptions of corruption 110
Positive affect 22
Negative affect 16
dtype: int64
解决方案
在[4]中:
# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)
在[5]中:
# 2nd check for number of null values
df.isnull().sum()
Out [5]: 没有更多的 NULL 值
Country name 0
year 0
Life Ladder 0
Log GDP per capita 0
Social support 0
Healthy life expectancy at birth 0
Freedom to make life choices 0
Generosity 0
Perceptions of corruption 0
Positive affect 0
Negative affect 0
dtype: int64
我想用同年其他列的平均值替换 NA 值。
注意:
要替换加拿大数据的 NA 值 ,我只想使用加拿大的平均值,当然不是整个数据集的平均值。
这是一个充满随机数的示例数据框。还有一些 NA 我是如何在我的数据框中找到它们的:
Country | Inhabitants | Year | Area | Cats | Dogs |
---|---|---|---|---|---|
Canada | 38 000 000 | 2021 | 4 | 32 | 21 |
Canada | 37 000 000 | 2020 | 4 | NA | 21 |
Canada | 36 000 000 | 2019 | 3 | 32 | 21 |
Canada | NA | 2018 | 2 | 32 | 21 |
Canada | 34 000 000 | 2017 | NA | 32 | 21 |
Canada | 35 000 000 | 2016 | 3 | 32 | NA |
Brazil | 212 000 000 | 2021 | 5 | 32 | 21 |
Brazil | 211 000 000 | 2020 | 4 | NA | 21 |
Brazil | 210 000 000 | 2019 | NA | 32 | 21 |
Brazil | 209 000 000 | 2018 | 4 | 32 | 21 |
Brazil | NA | 2017 | 2 | 32 | 21 |
Brazil | 207 000 000 | 2016 | 4 | 32 | NA |
pandas 用其他年份的平均值替换那些 NA 的最简单方法是什么?是否可以编写一个代码,使其可以遍历每个 NA 并立即替换它们(居民、区域、猫、狗)?
注意 示例基于评论中的附加数据源
将多列的 NA 值替换为 mean()
您可以结合以下三种方法:
根据您的示例创建数据框:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect |
---|---|---|---|---|---|---|---|---|---|---|
Canada | 2005 | 7.41805 | 10.6518 | 0.961552 | 71.3 | 0.957306 | 0.25623 | 0.502681 | 0.838544 | 0.233278 |
Canada | 2007 | 7.48175 | 10.7392 | nan | 71.66 | 0.930341 | 0.249479 | 0.405608 | 0.871604 | 0.25681 |
Canada | 2008 | 7.4856 | 10.7384 | 0.938707 | 71.84 | 0.926315 | 0.261585 | 0.369588 | 0.89022 | 0.202175 |
Canada | 2009 | 7.48782 | 10.6972 | 0.942845 | 72.02 | 0.915058 | 0.246217 | 0.412622 | 0.867433 | 0.247633 |
Canada | 2010 | 7.65035 | 10.7165 | 0.953765 | 72.2 | 0.933949 | 0.230451 | 0.41266 | 0.878868 | 0.233113 |
调用 fillna()
并遍历按国家/地区名称分组的所有列:
df = df.fillna(df.groupby('Country name').transform('mean'))
查看您在加拿大的成绩:
df[df['Country name'] == 'Canada']
Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect |
---|---|---|---|---|---|---|---|---|---|---|
Canada | 2005 | 7.41805 | 10.6518 | 0.961552 | 71.3 | 0.957306 | 0.25623 | 0.502681 | 0.838544 | 0.233278 |
Canada | 2007 | 7.48175 | 10.7392 | 0.93547 | 71.66 | 0.930341 | 0.249479 | 0.405608 | 0.871604 | 0.25681 |
Canada | 2008 | 7.4856 | 10.7384 | 0.938707 | 71.84 | 0.926315 | 0.261585 | 0.369588 | 0.89022 | 0.202175 |
Canada | 2009 | 7.48782 | 10.6972 | 0.942845 | 72.02 | 0.915058 | 0.246217 | 0.412622 | 0.867433 | 0.247633 |
Canada | 2010 | 7.65035 | 10.7165 | 0.953765 | 72.2 | 0.933949 | 0.230451 | 0.41266 | 0.878868 | 0.233113 |
这也有效:
在[2]中:
df = pd.read_excel('DataPanelWHR2021C2.xls')
在[3]中:
# Check for number of null values in df
df.isnull().sum()
出[3]:
Country name 0
year 0
Life Ladder 0
Log GDP per capita 36
Social support 13
Healthy life expectancy at birth 55
Freedom to make life choices 32
Generosity 89
Perceptions of corruption 110
Positive affect 22
Negative affect 16
dtype: int64
解决方案
在[4]中:
# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)
在[5]中:
# 2nd check for number of null values
df.isnull().sum()
Out [5]: 没有更多的 NULL 值
Country name 0
year 0
Life Ladder 0
Log GDP per capita 0
Social support 0
Healthy life expectancy at birth 0
Freedom to make life choices 0
Generosity 0
Perceptions of corruption 0
Positive affect 0
Negative affect 0
dtype: int64