如何根据组值计数填充数据框中的缺失值?
How to fill missing values in a dataframe based on group value counts?
我有一个包含 2 列的 pandas DataFrame:Year(int) 和 Condition(string)。在 Condition 列中,我有一个 nan 值,我想根据来自 groupby 操作的信息替换它。
import pandas as pd
import numpy as np
year = [2015, 2016, 2017, 2016, 2016, 2017, 2015, 2016, 2015, 2015]
cond = ["good", "good", "excellent", "good", 'excellent','excellent', np.nan, 'good','excellent', 'good']
X = pd.DataFrame({'year': year, 'condition': cond})
stat = X.groupby('year')['condition'].value_counts()
它给出:
print(X)
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 NaN
7 2016 good
8 2015 excellent
9 2015 good
print(stat)
year condition
2015 good 2
excellent 1
2016 good 3
excellent 1
2017 excellent 2
由于第 6 行中的 nan 值得到 year = 2015 并且从 stat 中我得到从 2015 年开始最常见的是 'good' 所以我想用 'good' 值替换这个 nan 值。
我试过使用 fillna 和 .transform 方法,但它不起作用:(
如有任何帮助,我将不胜感激。
我做了一些额外的转换,将 stat
作为字典将年份映射到它的最高频率名称(归功于 ):
In[0]:
fill_dict = stat.unstack().idxmax(axis=1).to_dict()
fill_dict
Out[0]:
{2015: 'good', 2016: 'good', 2017: 'excellent'}
然后根据此字典将 fillna
与 map
结合使用(归功于 ):
In[0]:
X['condition'] = X['condition'].fillna(X['year'].map(fill_dict))
X
Out[0]:
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 good
7 2016 good
8 2015 excellent
9 2015 good
我有一个包含 2 列的 pandas DataFrame:Year(int) 和 Condition(string)。在 Condition 列中,我有一个 nan 值,我想根据来自 groupby 操作的信息替换它。
import pandas as pd
import numpy as np
year = [2015, 2016, 2017, 2016, 2016, 2017, 2015, 2016, 2015, 2015]
cond = ["good", "good", "excellent", "good", 'excellent','excellent', np.nan, 'good','excellent', 'good']
X = pd.DataFrame({'year': year, 'condition': cond})
stat = X.groupby('year')['condition'].value_counts()
它给出:
print(X)
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 NaN
7 2016 good
8 2015 excellent
9 2015 good
print(stat)
year condition
2015 good 2
excellent 1
2016 good 3
excellent 1
2017 excellent 2
由于第 6 行中的 nan 值得到 year = 2015 并且从 stat 中我得到从 2015 年开始最常见的是 'good' 所以我想用 'good' 值替换这个 nan 值。
我试过使用 fillna 和 .transform 方法,但它不起作用:(
如有任何帮助,我将不胜感激。
我做了一些额外的转换,将 stat
作为字典将年份映射到它的最高频率名称(归功于
In[0]:
fill_dict = stat.unstack().idxmax(axis=1).to_dict()
fill_dict
Out[0]:
{2015: 'good', 2016: 'good', 2017: 'excellent'}
然后根据此字典将 fillna
与 map
结合使用(归功于
In[0]:
X['condition'] = X['condition'].fillna(X['year'].map(fill_dict))
X
Out[0]:
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 good
7 2016 good
8 2015 excellent
9 2015 good