Pandas 替换每列中的某些值
Pandas Replace certain values in each column
我有一个如下所示的数据框
+---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+
| | Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
+---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+
| 0 | 6 | 148.0 | 72.0 | 35.0 | 125.0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85.0 | 66.0 | 29.0 | 125.0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183.0 | 64.0 | 29.0 | 125.0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 |
+---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+
在查看每个 variable.I 的箱形图后发现它们中有异常值。
因此,在除 Outcome
之外的每一列中,我想替换该特定列的 greater than 95 percentile with value at 75 percentile
值和 less than 5 percentile with 25 percentile
值
例如,在第 Glucose
列中高于 95 个百分点的值我想用 Glucose
列
的第 75 个百分点的值替换它们
如何使用 pandas 过滤器和百分位数函数
如有任何帮助,我们将不胜感激
您可以在除 outcome
之外的所有列上使用 apply
,函数 np.clip
和 np.percentile
:
import numpy as np
percentile_df = df.set_index('Outcome').apply(lambda x: np.clip(x, *np.percentile(x, [25,75]))).reset_index()
>>> percentile_df
Outcome Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 1 6.0 148.0 66.0 35.0 125.0 33.6
1 0 1.0 89.0 66.0 29.0 125.0 26.6
2 1 6.0 148.0 64.0 29.0 125.0 26.6
3 0 1.0 89.0 66.0 29.0 125.0 28.1
4 1 1.0 137.0 64.0 35.0 125.0 33.6
DiabetesPedigreeFunction Age
0 0.627 33.0
1 0.351 31.0
2 0.672 32.0
3 0.351 31.0
4 0.672 33.0
[编辑] 一开始我误读了这个问题,这里有一种方法可以使用 np.select
将第 5 个和第 95 个百分位数分别更改为第 25 个和第 75 个百分位数:
def cut(column):
conds = [column > np.percentile(column, 95),
column < np.percentile(column, 5)]
choices = [np.percentile(column, 75),
np.percentile(column, 25)]
return np.select(conds,choices,column)
df.set_index('Outcome',inplace=True)
df = df.apply(lambda x: cut(x)).reset_index()
>>> df
Outcome Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 1 6.0 148.0 66.0 35.0 125.0 33.6
1 0 1.0 89.0 66.0 29.0 125.0 26.6
2 1 6.0 148.0 64.0 29.0 125.0 26.6
3 0 1.0 89.0 66.0 29.0 125.0 28.1
4 1 1.0 137.0 64.0 35.0 125.0 33.6
DiabetesPedigreeFunction Age
0 0.627 33.0
1 0.351 31.0
2 0.672 32.0
3 0.351 31.0
4 0.672 33.0
我有一个如下所示的数据框
+---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+
| | Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
+---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+
| 0 | 6 | 148.0 | 72.0 | 35.0 | 125.0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85.0 | 66.0 | 29.0 | 125.0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183.0 | 64.0 | 29.0 | 125.0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 |
+---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+
在查看每个 variable.I 的箱形图后发现它们中有异常值。
因此,在除 Outcome
之外的每一列中,我想替换该特定列的 greater than 95 percentile with value at 75 percentile
值和 less than 5 percentile with 25 percentile
值
例如,在第 Glucose
列中高于 95 个百分点的值我想用 Glucose
列
如何使用 pandas 过滤器和百分位数函数
如有任何帮助,我们将不胜感激
您可以在除 outcome
之外的所有列上使用 apply
,函数 np.clip
和 np.percentile
:
import numpy as np
percentile_df = df.set_index('Outcome').apply(lambda x: np.clip(x, *np.percentile(x, [25,75]))).reset_index()
>>> percentile_df
Outcome Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 1 6.0 148.0 66.0 35.0 125.0 33.6
1 0 1.0 89.0 66.0 29.0 125.0 26.6
2 1 6.0 148.0 64.0 29.0 125.0 26.6
3 0 1.0 89.0 66.0 29.0 125.0 28.1
4 1 1.0 137.0 64.0 35.0 125.0 33.6
DiabetesPedigreeFunction Age
0 0.627 33.0
1 0.351 31.0
2 0.672 32.0
3 0.351 31.0
4 0.672 33.0
[编辑] 一开始我误读了这个问题,这里有一种方法可以使用 np.select
将第 5 个和第 95 个百分位数分别更改为第 25 个和第 75 个百分位数:
def cut(column):
conds = [column > np.percentile(column, 95),
column < np.percentile(column, 5)]
choices = [np.percentile(column, 75),
np.percentile(column, 25)]
return np.select(conds,choices,column)
df.set_index('Outcome',inplace=True)
df = df.apply(lambda x: cut(x)).reset_index()
>>> df
Outcome Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 1 6.0 148.0 66.0 35.0 125.0 33.6
1 0 1.0 89.0 66.0 29.0 125.0 26.6
2 1 6.0 148.0 64.0 29.0 125.0 26.6
3 0 1.0 89.0 66.0 29.0 125.0 28.1
4 1 1.0 137.0 64.0 35.0 125.0 33.6
DiabetesPedigreeFunction Age
0 0.627 33.0
1 0.351 31.0
2 0.672 32.0
3 0.351 31.0
4 0.672 33.0