如何一次对数据中的所有列进行分类? (使所有值变为高、中、低)

How can I categorize all columns in a data at once? (Make all values become High, Medium, Low)

我正在尝试将我的数据集中的所有值转换为分类值,我希望根据它们的分位数值将所有数值分类为低、平均或高。

所以如果数值低于系列的25%,就会被转换为"Low"

我尝试使用 assign 然后应用我提供的函数:

def turn_into_categorical(row):
    quantile_level = [.25, .5, .75]
    for r in row:
        cut = refugees_T_F_V_P_full_data.r.quantile(quantile_level)
        if r >= cut[.75]:
            return "High"
        elif r >= cut[.25] and r < cut[0.75]:
            return "Average"
        else:
            return "Low"

refugees_T_F_V_P_full_data.apply(turn_into_categorical, axis = 1)

但是,代码运行不正常。我也通过 iterrows 尝试过,但我想知道是否有更快的方法?

这是我要转换的数据,除年和月外的所有数字都应根据其分位数值分为低、中、高。

    Year  Month  Central Equatoria  Eastern Equatoria  Gogrial  Jonglei
0   2014     10                6.0                1.0      0.0      3.0   
1   2014     11                4.0                3.0      0.0     12.0   
2   2014     12                3.0                5.0      0.0     11.0   
3   2015      1                7.0                2.0      0.0      4.0   
4   2015      2                5.0                5.0      0.0     10.0   
5   2015      3                7.0                5.0      0.0      8.0   
6   2015      4                4.0                1.0      0.0      6.0   
7   2015      5                5.0                0.0      0.0      7.0   
8   2015      6                4.0                1.0      0.0      6.0   
9   2015      7               15.0                2.0      0.0      9.0   
10  2015      8               10.0                7.0      0.0      9.0   
11  2015      9               12.0                0.0      0.0      8.0   
12  2015     10               12.0                0.0      0.0      5.0   
13  2015     11                8.0                5.0      0.0     10.0   
14  2015     12                5.0                7.0      0.0      3.0 

预期结果:(示例)

    Year  Month  Central Equatoria  Eastern Equatoria  Gogrial  Jonglei
0   2014     10                High             Medium      Low      Medium  
1   2014     11                Low              Medium      Low     high

使用 pd.DataFrame.quantilepd.Series.cut 的一个想法:

cats = ['Low', 'Medium', 'High']
quantiles = df.iloc[:, 2:].quantile([0, 0.25, 0.75, 1.0])

for col in df.iloc[:, 2:]:
    bin_edges = quantiles[col]
    # special case situations where all values are equal
    if bin_edges.nunique() == 1:
        df[col] = 'Low'
    else:
        df[col] = pd.cut(df[col], bins=bin_edges, labels=cats, include_lowest=True)

结果:

print(df)

    Year  Month CentralEquatoria EasternEquatoria Gogrial Jonglei
0   2014     10           Medium              Low     Low     Low
1   2014     11              Low           Medium     Low    High
2   2014     12              Low           Medium     Low    High
3   2015      1           Medium           Medium     Low     Low
4   2015      2           Medium           Medium     Low    High
5   2015      3           Medium           Medium     Low  Medium
6   2015      4              Low              Low     Low  Medium
7   2015      5           Medium              Low     Low  Medium
8   2015      6              Low              Low     Low  Medium
9   2015      7             High           Medium     Low  Medium
10  2015      8             High             High     Low  Medium
11  2015      9             High              Low     Low  Medium
12  2015     10             High              Low     Low     Low
13  2015     11           Medium           Medium     Low    High
14  2015     12           Medium             High     Low     Low

看起来您想要 pd.qcut,这正是您想要的。来自文档:

Quantile-based discretization function

因此您可以 apply pd.qcutCentral Equatoria 开始沿着数据框的列,指定要用于将系列与 q = [0, 0.25, 0.75, 1.0] 装箱的分位数

df.loc[:,'Central Equatoria':].apply(lambda x: pd.qcut(x, q=[0, 0.25, 0.75, 1.0], 
                                    labels =['low','medium','high']) 
                                    if not x.nunique() == 1 else 'low'))

输出

       Central Equatoria Eastern Equatoria Gogrial Jonglei
0            medium              low     low     low
1               low           medium     low    high
2               low           medium     low    high
3            medium           medium     low     low
4            medium           medium     low    high
5            medium           medium     low  medium
6               low              low     low  medium
7            medium              low     low  medium
8               low              low     low  medium
9              high           medium     low  medium
10             high             high     low  medium
11             high              low     low  medium
12             high              low     low     low
13           medium           medium     low    high
14           medium             high     low     low

使用 pd.cut()df.apply():

df.iloc[:,2:]=df.iloc[:,2:].apply(lambda x:pd.cut(x, 3, labels=['Low','Med','High']), axis=1)

    Year    Month   Central_Equatoria   Eastern_Equatoria   Gogrial Jonglei
0   2014    10      High    Low         Low                 Med
1   2014    11      Low     Low         Low                 High
2   2014    12      Low     Med         Low                 High
3   2015    1       High    Low         Low                 Med
4   2015    2       Med     Med         Low                 High
5   2015    3       High    Med         Low                 High

结束使用最古老的时尚:

new_df = pd.DataFrame()
name_list = list(df)

for name in name_list:
    if name != 'Year' and name != 'Month':
        new_row = []
        quantiles = df[name].quantile([.25, .5, .75])
        row_list = df[name].tolist()
        for i, value in enumerate(row_list):
            if value < quantiles[.25]:
                new_row.append("Low")
            elif value < quantiles[.75] and value >= quantiles[.25]:
                new_row.append("Average")
            else:
                new_row.append("High")
        series = pd.Series(new_row)
        new_df[name] = series.values

new_df.head()