如何一次对数据中的所有列进行分类? (使所有值变为高、中、低)
How can I categorize all columns in a data at once? (Make all values become High, Medium, Low)
我正在尝试将我的数据集中的所有值转换为分类值,我希望根据它们的分位数值将所有数值分类为低、平均或高。
所以如果数值低于系列的25%,就会被转换为"Low"
我尝试使用 assign 然后应用我提供的函数:
def turn_into_categorical(row):
quantile_level = [.25, .5, .75]
for r in row:
cut = refugees_T_F_V_P_full_data.r.quantile(quantile_level)
if r >= cut[.75]:
return "High"
elif r >= cut[.25] and r < cut[0.75]:
return "Average"
else:
return "Low"
refugees_T_F_V_P_full_data.apply(turn_into_categorical, axis = 1)
但是,代码运行不正常。我也通过 iterrows 尝试过,但我想知道是否有更快的方法?
这是我要转换的数据,除年和月外的所有数字都应根据其分位数值分为低、中、高。
Year Month Central Equatoria Eastern Equatoria Gogrial Jonglei
0 2014 10 6.0 1.0 0.0 3.0
1 2014 11 4.0 3.0 0.0 12.0
2 2014 12 3.0 5.0 0.0 11.0
3 2015 1 7.0 2.0 0.0 4.0
4 2015 2 5.0 5.0 0.0 10.0
5 2015 3 7.0 5.0 0.0 8.0
6 2015 4 4.0 1.0 0.0 6.0
7 2015 5 5.0 0.0 0.0 7.0
8 2015 6 4.0 1.0 0.0 6.0
9 2015 7 15.0 2.0 0.0 9.0
10 2015 8 10.0 7.0 0.0 9.0
11 2015 9 12.0 0.0 0.0 8.0
12 2015 10 12.0 0.0 0.0 5.0
13 2015 11 8.0 5.0 0.0 10.0
14 2015 12 5.0 7.0 0.0 3.0
预期结果:(示例)
Year Month Central Equatoria Eastern Equatoria Gogrial Jonglei
0 2014 10 High Medium Low Medium
1 2014 11 Low Medium Low high
使用 pd.DataFrame.quantile
和 pd.Series.cut
的一个想法:
cats = ['Low', 'Medium', 'High']
quantiles = df.iloc[:, 2:].quantile([0, 0.25, 0.75, 1.0])
for col in df.iloc[:, 2:]:
bin_edges = quantiles[col]
# special case situations where all values are equal
if bin_edges.nunique() == 1:
df[col] = 'Low'
else:
df[col] = pd.cut(df[col], bins=bin_edges, labels=cats, include_lowest=True)
结果:
print(df)
Year Month CentralEquatoria EasternEquatoria Gogrial Jonglei
0 2014 10 Medium Low Low Low
1 2014 11 Low Medium Low High
2 2014 12 Low Medium Low High
3 2015 1 Medium Medium Low Low
4 2015 2 Medium Medium Low High
5 2015 3 Medium Medium Low Medium
6 2015 4 Low Low Low Medium
7 2015 5 Medium Low Low Medium
8 2015 6 Low Low Low Medium
9 2015 7 High Medium Low Medium
10 2015 8 High High Low Medium
11 2015 9 High Low Low Medium
12 2015 10 High Low Low Low
13 2015 11 Medium Medium Low High
14 2015 12 Medium High Low Low
看起来您想要 pd.qcut
,这正是您想要的。来自文档:
Quantile-based discretization function
因此您可以 apply
pd.qcut
从 Central Equatoria
开始沿着数据框的列,指定要用于将系列与 q = [0, 0.25, 0.75, 1.0]
装箱的分位数
df.loc[:,'Central Equatoria':].apply(lambda x: pd.qcut(x, q=[0, 0.25, 0.75, 1.0],
labels =['low','medium','high'])
if not x.nunique() == 1 else 'low'))
输出
Central Equatoria Eastern Equatoria Gogrial Jonglei
0 medium low low low
1 low medium low high
2 low medium low high
3 medium medium low low
4 medium medium low high
5 medium medium low medium
6 low low low medium
7 medium low low medium
8 low low low medium
9 high medium low medium
10 high high low medium
11 high low low medium
12 high low low low
13 medium medium low high
14 medium high low low
使用 pd.cut()
和 df.apply()
:
df.iloc[:,2:]=df.iloc[:,2:].apply(lambda x:pd.cut(x, 3, labels=['Low','Med','High']), axis=1)
Year Month Central_Equatoria Eastern_Equatoria Gogrial Jonglei
0 2014 10 High Low Low Med
1 2014 11 Low Low Low High
2 2014 12 Low Med Low High
3 2015 1 High Low Low Med
4 2015 2 Med Med Low High
5 2015 3 High Med Low High
结束使用最古老的时尚:
new_df = pd.DataFrame()
name_list = list(df)
for name in name_list:
if name != 'Year' and name != 'Month':
new_row = []
quantiles = df[name].quantile([.25, .5, .75])
row_list = df[name].tolist()
for i, value in enumerate(row_list):
if value < quantiles[.25]:
new_row.append("Low")
elif value < quantiles[.75] and value >= quantiles[.25]:
new_row.append("Average")
else:
new_row.append("High")
series = pd.Series(new_row)
new_df[name] = series.values
new_df.head()
我正在尝试将我的数据集中的所有值转换为分类值,我希望根据它们的分位数值将所有数值分类为低、平均或高。
所以如果数值低于系列的25%,就会被转换为"Low"
我尝试使用 assign 然后应用我提供的函数:
def turn_into_categorical(row):
quantile_level = [.25, .5, .75]
for r in row:
cut = refugees_T_F_V_P_full_data.r.quantile(quantile_level)
if r >= cut[.75]:
return "High"
elif r >= cut[.25] and r < cut[0.75]:
return "Average"
else:
return "Low"
refugees_T_F_V_P_full_data.apply(turn_into_categorical, axis = 1)
但是,代码运行不正常。我也通过 iterrows 尝试过,但我想知道是否有更快的方法?
这是我要转换的数据,除年和月外的所有数字都应根据其分位数值分为低、中、高。
Year Month Central Equatoria Eastern Equatoria Gogrial Jonglei
0 2014 10 6.0 1.0 0.0 3.0
1 2014 11 4.0 3.0 0.0 12.0
2 2014 12 3.0 5.0 0.0 11.0
3 2015 1 7.0 2.0 0.0 4.0
4 2015 2 5.0 5.0 0.0 10.0
5 2015 3 7.0 5.0 0.0 8.0
6 2015 4 4.0 1.0 0.0 6.0
7 2015 5 5.0 0.0 0.0 7.0
8 2015 6 4.0 1.0 0.0 6.0
9 2015 7 15.0 2.0 0.0 9.0
10 2015 8 10.0 7.0 0.0 9.0
11 2015 9 12.0 0.0 0.0 8.0
12 2015 10 12.0 0.0 0.0 5.0
13 2015 11 8.0 5.0 0.0 10.0
14 2015 12 5.0 7.0 0.0 3.0
预期结果:(示例)
Year Month Central Equatoria Eastern Equatoria Gogrial Jonglei
0 2014 10 High Medium Low Medium
1 2014 11 Low Medium Low high
使用 pd.DataFrame.quantile
和 pd.Series.cut
的一个想法:
cats = ['Low', 'Medium', 'High']
quantiles = df.iloc[:, 2:].quantile([0, 0.25, 0.75, 1.0])
for col in df.iloc[:, 2:]:
bin_edges = quantiles[col]
# special case situations where all values are equal
if bin_edges.nunique() == 1:
df[col] = 'Low'
else:
df[col] = pd.cut(df[col], bins=bin_edges, labels=cats, include_lowest=True)
结果:
print(df)
Year Month CentralEquatoria EasternEquatoria Gogrial Jonglei
0 2014 10 Medium Low Low Low
1 2014 11 Low Medium Low High
2 2014 12 Low Medium Low High
3 2015 1 Medium Medium Low Low
4 2015 2 Medium Medium Low High
5 2015 3 Medium Medium Low Medium
6 2015 4 Low Low Low Medium
7 2015 5 Medium Low Low Medium
8 2015 6 Low Low Low Medium
9 2015 7 High Medium Low Medium
10 2015 8 High High Low Medium
11 2015 9 High Low Low Medium
12 2015 10 High Low Low Low
13 2015 11 Medium Medium Low High
14 2015 12 Medium High Low Low
看起来您想要 pd.qcut
,这正是您想要的。来自文档:
Quantile-based discretization function
因此您可以 apply
pd.qcut
从 Central Equatoria
开始沿着数据框的列,指定要用于将系列与 q = [0, 0.25, 0.75, 1.0]
装箱的分位数
df.loc[:,'Central Equatoria':].apply(lambda x: pd.qcut(x, q=[0, 0.25, 0.75, 1.0],
labels =['low','medium','high'])
if not x.nunique() == 1 else 'low'))
输出
Central Equatoria Eastern Equatoria Gogrial Jonglei
0 medium low low low
1 low medium low high
2 low medium low high
3 medium medium low low
4 medium medium low high
5 medium medium low medium
6 low low low medium
7 medium low low medium
8 low low low medium
9 high medium low medium
10 high high low medium
11 high low low medium
12 high low low low
13 medium medium low high
14 medium high low low
使用 pd.cut()
和 df.apply()
:
df.iloc[:,2:]=df.iloc[:,2:].apply(lambda x:pd.cut(x, 3, labels=['Low','Med','High']), axis=1)
Year Month Central_Equatoria Eastern_Equatoria Gogrial Jonglei
0 2014 10 High Low Low Med
1 2014 11 Low Low Low High
2 2014 12 Low Med Low High
3 2015 1 High Low Low Med
4 2015 2 Med Med Low High
5 2015 3 High Med Low High
结束使用最古老的时尚:
new_df = pd.DataFrame()
name_list = list(df)
for name in name_list:
if name != 'Year' and name != 'Month':
new_row = []
quantiles = df[name].quantile([.25, .5, .75])
row_list = df[name].tolist()
for i, value in enumerate(row_list):
if value < quantiles[.25]:
new_row.append("Low")
elif value < quantiles[.75] and value >= quantiles[.25]:
new_row.append("Average")
else:
new_row.append("High")
series = pd.Series(new_row)
new_df[name] = series.values
new_df.head()