离散化:将连续值转化为一定数量的类别
Discretization : converting continuous values into a certain number of categories
1 Create a column Usage_Per_Year from Miles_Driven_Per_Year by discretizing the values into three equally sized categories. The names of the categories should be Low, Medium, and High.
2 Group by Usage_Per_Year and print the group sizes as well as the ranges of each.
3 Do the same as in #1, but instead of equally sized categories, create categories that have the same number of points per category.
4 Group by Usage_Per_Year and print the group sizes as well as the ranges of each.
我的代码如下
df["Usage_Per_Year "], bins = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2, retbins=True)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.2
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
#3.3.3
Year2 = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.4
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
输出如下:
Usage_Per_Year 0 Low (-1925.883, 663476.235] 6018 Medium (663476.235, 1326888.118] 0 High (1326888.118, 1990300.0] 1
Usage_Per_Year 0 Low (-1925.883, 663476.235] 6018 Medium (663476.235, 1326888.118] 0 High (1326888.118, 1990300.0] 1
但是-1925 是错误的...
正确答案应该是这样的。
我该怎么做...
第 1 行可能有错别字:df["Usage_Per_Year "]
?列名末尾有一个space。
pd.cut
将值分成相等的大小。这就是为什么您所有的箱子都具有相同的尺寸。看来你应该计算每个组的最小值和最大值 after 分箱。
此外,要将值分到相同的频率,您应该使用 pd.qcut
。
示例输入:
import numpy as np
import pandas as pd
rng = np.random.default_rng(20210514)
df = pd.DataFrame({
'Miles_Driven_Per_Year': rng.gamma(1.05, 10000, (1000,)).astype(int)
})
# 1
group_label = ['Low', 'Medium', 'High']
df['Usage_Per_Year'] = pd.cut(df['Miles_Driven_Per_Year'],
bins=3, labels=group_label)
# 2
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
# 3
df['Usage_Per_Year'] = pd.qcut(df['Miles_Driven_Per_Year'],
q=3, labels=group_label)
# 4
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
示例输出:
Miles_Driven_Per_Year
count min max
Usage_Per_Year
Low 878 31 20905
Medium 107 20955 41196
High 15 41991 62668
Miles_Driven_Per_Year
count min max
Usage_Per_Year
Low 334 31 4378
Medium 333 4449 11424
High 333 11442 62668
1 Create a column Usage_Per_Year from Miles_Driven_Per_Year by discretizing the values into three equally sized categories. The names of the categories should be Low, Medium, and High.
2 Group by Usage_Per_Year and print the group sizes as well as the ranges of each.
3 Do the same as in #1, but instead of equally sized categories, create categories that have the same number of points per category.
4 Group by Usage_Per_Year and print the group sizes as well as the ranges of each.
我的代码如下
df["Usage_Per_Year "], bins = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2, retbins=True)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.2
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
#3.3.3
Year2 = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.4
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
输出如下:
Usage_Per_Year 0 Low (-1925.883, 663476.235] 6018 Medium (663476.235, 1326888.118] 0 High (1326888.118, 1990300.0] 1 Usage_Per_Year 0 Low (-1925.883, 663476.235] 6018 Medium (663476.235, 1326888.118] 0 High (1326888.118, 1990300.0] 1
但是-1925 是错误的...
正确答案应该是这样的。
我该怎么做...
第 1 行可能有错别字:df["Usage_Per_Year "]
?列名末尾有一个space。
pd.cut
将值分成相等的大小。这就是为什么您所有的箱子都具有相同的尺寸。看来你应该计算每个组的最小值和最大值 after 分箱。
此外,要将值分到相同的频率,您应该使用 pd.qcut
。
示例输入:
import numpy as np
import pandas as pd
rng = np.random.default_rng(20210514)
df = pd.DataFrame({
'Miles_Driven_Per_Year': rng.gamma(1.05, 10000, (1000,)).astype(int)
})
# 1
group_label = ['Low', 'Medium', 'High']
df['Usage_Per_Year'] = pd.cut(df['Miles_Driven_Per_Year'],
bins=3, labels=group_label)
# 2
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
# 3
df['Usage_Per_Year'] = pd.qcut(df['Miles_Driven_Per_Year'],
q=3, labels=group_label)
# 4
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
示例输出:
Miles_Driven_Per_Year
count min max
Usage_Per_Year
Low 878 31 20905
Medium 107 20955 41196
High 15 41991 62668
Miles_Driven_Per_Year
count min max
Usage_Per_Year
Low 334 31 4378
Medium 333 4449 11424
High 333 11442 62668