计算累计比例销售产品数量
Calculating cumulative proportion sales product count
我有一个数据框,其销售额为 ppg,产品级别,我想了解有多少产品对销售额的特定百分比(例如 75%)有贡献,例如测试帕累托原则。
数据为
df= structure(list(Ppg = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1",
"p2"), class = "factor"), product = structure(c(1L, 2L, 3L, 4L,
1L, 2L, 3L), .Label = c("A", "B", "C", "D"), class = "factor"),
sales = c(50, 40, 30, 80, 100, 70, 30)), .Names = c("Ppg",
"product", "sales"), row.names = c(NA, -7L), class = "data.frame")
> df
Ppg product sales
1 p1 A 50
2 p1 B 40
3 p1 C 30
4 p1 D 80
5 p2 A 100
6 p2 B 70
7 p2 C 30
我使用 dplyr 检索了累计和
df %>% group_by(Ppg) %>% summarise(sale = sum(sales) %>% mutate(c1 = cumsum(sales))
Ppg product sales c1
<fctr> <fctr> <dbl> <dbl>
1 p1 A 50 50
2 p1 B 40 90
3 p1 C 30 120
4 p1 D 80 200
5 p2 A 100 100
6 p2 B 70 170
7 p2 C 30 200
有什么办法
i) 计算销售额的比例(基于 cumsum)
ii) 有多少不同的产品贡献了一定百分比的销售额。
ppg p1 的示例,2 个不同的产品(A 和 B 组合占销售额的 75%)
所以最后像下面这样的东西是理想的
ppg Number_Products_towards_75%
p1 2
p2 1
假设您可以使用产品当前的顺序来得到答案(因为重新排序行会得到不同的结果):
对于1,你可以通过额外的变异得到结果。只需将累计总和除以该组中所有销售额的总和即可:
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales))
得到你:
# A tibble: 7 x 5
# Groups: Ppg [2]
Ppg product sales c1 percent
<fctr> <fctr> <dbl> <dbl> <dbl>
1 p1 A 50.0 50.0 0.250
2 p1 B 40.0 90.0 0.450
3 p1 C 30.0 120 0.600
4 p1 D 80.0 200 1.00
5 p2 A 100 100 0.500
6 p2 B 70.0 170 0.850
7 p2 C 30.0 200 1.00
对于 2,如果该产品低于阈值,您可以使用 mutate 添加一列并汇总以计算低于阈值的产品(然后将一个添加到计数中,因为再添加一个就可以了)它)。
threshold <- 0.5
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales)) %>%
mutate(isbelowthreshold = percent < threshold) %>% # add a column for if it's below the threshold
summarize(count = sum(isbelowthreshold) + 1) # we need to add one since one extra product will put you over the threshold
让你:
# A tibble: 2 x 2
Ppg count
<fctr> <dbl>
1 p1 3.00
2 p2 1.00
但这同样取决于产品的顺序。考虑先从最高值到最低值对它们进行排序?像
df %>%
group_by(Ppg) %>%
arrange(Ppg, desc(sales))
我有一个数据框,其销售额为 ppg,产品级别,我想了解有多少产品对销售额的特定百分比(例如 75%)有贡献,例如测试帕累托原则。
数据为
df= structure(list(Ppg = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1",
"p2"), class = "factor"), product = structure(c(1L, 2L, 3L, 4L,
1L, 2L, 3L), .Label = c("A", "B", "C", "D"), class = "factor"),
sales = c(50, 40, 30, 80, 100, 70, 30)), .Names = c("Ppg",
"product", "sales"), row.names = c(NA, -7L), class = "data.frame")
> df
Ppg product sales
1 p1 A 50
2 p1 B 40
3 p1 C 30
4 p1 D 80
5 p2 A 100
6 p2 B 70
7 p2 C 30
我使用 dplyr 检索了累计和
df %>% group_by(Ppg) %>% summarise(sale = sum(sales) %>% mutate(c1 = cumsum(sales))
Ppg product sales c1
<fctr> <fctr> <dbl> <dbl>
1 p1 A 50 50
2 p1 B 40 90
3 p1 C 30 120
4 p1 D 80 200
5 p2 A 100 100
6 p2 B 70 170
7 p2 C 30 200
有什么办法
i) 计算销售额的比例(基于 cumsum)
ii) 有多少不同的产品贡献了一定百分比的销售额。
ppg p1 的示例,2 个不同的产品(A 和 B 组合占销售额的 75%)
所以最后像下面这样的东西是理想的
ppg Number_Products_towards_75%
p1 2
p2 1
假设您可以使用产品当前的顺序来得到答案(因为重新排序行会得到不同的结果):
对于1,你可以通过额外的变异得到结果。只需将累计总和除以该组中所有销售额的总和即可:
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales))
得到你:
# A tibble: 7 x 5
# Groups: Ppg [2]
Ppg product sales c1 percent
<fctr> <fctr> <dbl> <dbl> <dbl>
1 p1 A 50.0 50.0 0.250
2 p1 B 40.0 90.0 0.450
3 p1 C 30.0 120 0.600
4 p1 D 80.0 200 1.00
5 p2 A 100 100 0.500
6 p2 B 70.0 170 0.850
7 p2 C 30.0 200 1.00
对于 2,如果该产品低于阈值,您可以使用 mutate 添加一列并汇总以计算低于阈值的产品(然后将一个添加到计数中,因为再添加一个就可以了)它)。
threshold <- 0.5
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales)) %>%
mutate(isbelowthreshold = percent < threshold) %>% # add a column for if it's below the threshold
summarize(count = sum(isbelowthreshold) + 1) # we need to add one since one extra product will put you over the threshold
让你:
# A tibble: 2 x 2
Ppg count
<fctr> <dbl>
1 p1 3.00
2 p2 1.00
但这同样取决于产品的顺序。考虑先从最高值到最低值对它们进行排序?像
df %>%
group_by(Ppg) %>%
arrange(Ppg, desc(sales))