在应用 .group by() 参数后用 pandas 数据帧中的 NaN 替换离群值
Replacing outliers with NaN in pandas dataframe after applying a .groupby() arguement
我想在应用 groupby 函数后使用列变量的标准差从 pandas 数据框中删除异常值。
这是我的数据框:
ARI Flesch Kincaid Speaker Score
0 -2.090000 121.220000 -3.400000 NaN NaN
1 8.276460 64.478573 9.034156 William Dudley 1.670275
2 19.570911 27.362067 17.253580 Janet Yellen -0.604757
3 -2.090000 121.220000 -3.400000 NaN NaN
4 -2.090000 121.220000 -3.400000 NaN NaN
5 20.643483 17.069411 18.394178 Lael Brainard 0.215396
6 -2.090000 121.220000 -3.400000 NaN NaN
7 -2.090000 121.220000 -3.400000 NaN NaN
8 12.624198 52.220468 11.403157 Jerome H. Powell -1.350798
9 18.466305 35.186261 16.205693 Stanley Fischer 0.522121
10 -2.090000 121.220000 -3.400000 NaN NaN
11 16.953460 36.246573 15.323457 Lael Brainard -0.217779
12 -2.090000 121.220000 -3.400000 NaN NaN
13 -2.090000 121.220000 -3.400000 NaN NaN
14 17.066088 32.592551 16.108486 Stanley Fischer 0.642245
15 -2.090000 121.220000 -3.400000 NaN NaN
我想先按 'Speaker' 对数据帧进行分组,然后删除 'ARI'、'Flesch' 和 'Kincaid' 定义为超过 3 的离群值与特定特征得分平均值的标准差。
如果可行,请告诉我。谢谢!
此方法唯一需要的依赖项是 Pandas
假设我们已经将 'Speaker' 列的值 'NaN' 替换为具有代表性的值,例如 'CommitteOrganization'
speaker = dataset['Speaker'].fillna(value='CommitteeOrganization')
dataset['Speaker'] = speaker
所以我们的数据如下:
Index ARI Flesch Kincaid Speaker Score
0 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
1 8.276460 64.478573 9.034156 WilliamDudley 1.670275
2 19.570911 27.362067 17.253580 JanetYellen -0.604757
3 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
4 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
Group by 使用 Pandas 函数:
datasetGrouped = dataset.groupby(by='Speaker').mean()
所以我们的数据如下:
Speaker ARI Flesch Kincaid Score
CommitteeOrganization -2.090000 121.220000 -3.400000 NaN
JanetYellen 19.570911 27.362067 17.253580 -0.604757
JeromeH.Powell 12.624198 52.220468 11.403157 -1.350798
LaelBrainard 18.798471 26.657992 16.858818 -0.001191
StanleyFischer 17.766196 33.889406 16.157089 0.582183
WilliamDudley 8.276460 64.478573 9.034156 1.670275
计算每列的标准差:
aristd = datasetGrouped['ARI'].std()
fleschstd = datasetGrouped['Flesch'].std()
kincaidstd = datasetGrouped['Kincaid'].std()
将满足条件的行的值替换为'NaN':
datasetGrouped.loc[abs(datasetGrouped.ARI) > aristd*3,'ARI'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Flesch) > fleschstd*3,'Flesch'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Kincaid) > kincaidstd*3,'Kincaid'] = 'NaN'
最终数据集是:
Speaker ARI Flesch Kincaid Score
CommitteeOrganization -2.090000 NaN -3.400000 NaN
JanetYellen 19.570911 27.3621 17.253580 -0.604757
JeromeH.Powell 12.624198 52.2205 11.403157 -1.350798
LaelBrainard 18.798471 26.658 16.858818 -0.001191
StanleyFischer 17.766196 33.8894 16.157089 0.582183
WilliamDudley 8.276460 64.4786 9.034156 1.670275
完整代码可用:Github
注意:这可以用比提供的代码更少的代码来完成,但答案已经完成 "step by step" 以便于理解。
注2:因为问题有点模棱两可,如果我有什么地方没看懂,没有提供正确的答案,不要犹豫告诉我,我会更新尽可能回答
我想在应用 groupby 函数后使用列变量的标准差从 pandas 数据框中删除异常值。
这是我的数据框:
ARI Flesch Kincaid Speaker Score
0 -2.090000 121.220000 -3.400000 NaN NaN
1 8.276460 64.478573 9.034156 William Dudley 1.670275
2 19.570911 27.362067 17.253580 Janet Yellen -0.604757
3 -2.090000 121.220000 -3.400000 NaN NaN
4 -2.090000 121.220000 -3.400000 NaN NaN
5 20.643483 17.069411 18.394178 Lael Brainard 0.215396
6 -2.090000 121.220000 -3.400000 NaN NaN
7 -2.090000 121.220000 -3.400000 NaN NaN
8 12.624198 52.220468 11.403157 Jerome H. Powell -1.350798
9 18.466305 35.186261 16.205693 Stanley Fischer 0.522121
10 -2.090000 121.220000 -3.400000 NaN NaN
11 16.953460 36.246573 15.323457 Lael Brainard -0.217779
12 -2.090000 121.220000 -3.400000 NaN NaN
13 -2.090000 121.220000 -3.400000 NaN NaN
14 17.066088 32.592551 16.108486 Stanley Fischer 0.642245
15 -2.090000 121.220000 -3.400000 NaN NaN
我想先按 'Speaker' 对数据帧进行分组,然后删除 'ARI'、'Flesch' 和 'Kincaid' 定义为超过 3 的离群值与特定特征得分平均值的标准差。
如果可行,请告诉我。谢谢!
此方法唯一需要的依赖项是 Pandas
假设我们已经将 'Speaker' 列的值 'NaN' 替换为具有代表性的值,例如 'CommitteOrganization'
speaker = dataset['Speaker'].fillna(value='CommitteeOrganization')
dataset['Speaker'] = speaker
所以我们的数据如下:
Index ARI Flesch Kincaid Speaker Score
0 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
1 8.276460 64.478573 9.034156 WilliamDudley 1.670275
2 19.570911 27.362067 17.253580 JanetYellen -0.604757
3 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
4 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
Group by 使用 Pandas 函数:
datasetGrouped = dataset.groupby(by='Speaker').mean()
所以我们的数据如下:
Speaker ARI Flesch Kincaid Score
CommitteeOrganization -2.090000 121.220000 -3.400000 NaN
JanetYellen 19.570911 27.362067 17.253580 -0.604757
JeromeH.Powell 12.624198 52.220468 11.403157 -1.350798
LaelBrainard 18.798471 26.657992 16.858818 -0.001191
StanleyFischer 17.766196 33.889406 16.157089 0.582183
WilliamDudley 8.276460 64.478573 9.034156 1.670275
计算每列的标准差:
aristd = datasetGrouped['ARI'].std()
fleschstd = datasetGrouped['Flesch'].std()
kincaidstd = datasetGrouped['Kincaid'].std()
将满足条件的行的值替换为'NaN':
datasetGrouped.loc[abs(datasetGrouped.ARI) > aristd*3,'ARI'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Flesch) > fleschstd*3,'Flesch'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Kincaid) > kincaidstd*3,'Kincaid'] = 'NaN'
最终数据集是:
Speaker ARI Flesch Kincaid Score
CommitteeOrganization -2.090000 NaN -3.400000 NaN
JanetYellen 19.570911 27.3621 17.253580 -0.604757
JeromeH.Powell 12.624198 52.2205 11.403157 -1.350798
LaelBrainard 18.798471 26.658 16.858818 -0.001191
StanleyFischer 17.766196 33.8894 16.157089 0.582183
WilliamDudley 8.276460 64.4786 9.034156 1.670275
完整代码可用:Github
注意:这可以用比提供的代码更少的代码来完成,但答案已经完成 "step by step" 以便于理解。
注2:因为问题有点模棱两可,如果我有什么地方没看懂,没有提供正确的答案,不要犹豫告诉我,我会更新尽可能回答