离散化:将连续值转化为一定数量的类别

Discretization : converting continuous values into a certain number of categories

1   Create a column Usage_Per_Year from Miles_Driven_Per_Year by discretizing the values into three equally sized categories. The names of the categories should be Low, Medium, and High.

2   Group by Usage_Per_Year and print the group sizes as well as the ranges of each.

3   Do the same as in #1, but instead of equally sized categories, create categories that have the same number of points per category.

4   Group by Usage_Per_Year and print the group sizes as well as the ranges of each.

我的代码如下

df["Usage_Per_Year "], bins = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2, retbins=True)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.2
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
#3.3.3
Year2 = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.4
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))

输出如下:

               Usage_Per_Year     0 Low       (-1925.883, 663476.235]  6018 Medium  (663476.235, 1326888.118]     0 High     (1326888.118, 1990300.0]     1
               Usage_Per_Year     0 Low       (-1925.883, 663476.235]  6018 Medium  (663476.235, 1326888.118]     0 High     (1326888.118, 1990300.0]     1

但是-1925 是错误的...

正确答案应该是这样的。

我该怎么做...

第 1 行可能有错别字:df["Usage_Per_Year "]?列名末尾有一个space。

pd.cut 将值分成相等的大小。这就是为什么您所有的箱子都具有相同的尺寸。看来你应该计算每个组的最小值和最大值 after 分箱。

此外,要将值分到相同的频率,您应该使用 pd.qcut


示例输入:

import numpy as np
import pandas as pd

rng = np.random.default_rng(20210514)
df = pd.DataFrame({
    'Miles_Driven_Per_Year': rng.gamma(1.05, 10000, (1000,)).astype(int)
})

# 1
group_label = ['Low', 'Medium', 'High']
df['Usage_Per_Year'] = pd.cut(df['Miles_Driven_Per_Year'],
                              bins=3, labels=group_label)

# 2
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))

# 3
df['Usage_Per_Year'] = pd.qcut(df['Miles_Driven_Per_Year'],
                               q=3, labels=group_label)

# 4
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))

示例输出:

               Miles_Driven_Per_Year              
                               count    min    max
Usage_Per_Year                                    
Low                              878     31  20905
Medium                           107  20955  41196
High                              15  41991  62668
               Miles_Driven_Per_Year              
                               count    min    max
Usage_Per_Year                                    
Low                              334     31   4378
Medium                           333   4449  11424
High                             333  11442  62668