在 R 中,如何按一列分组并有条件地对另一列求和?
In R, how can I group by one column and conditionally sum another?
这是对我之前问题的补充:
假设我有下面的数据框。在我之前的问题中,我问过如何在每一行计算该行的客户随后订购产品 X(字面意思是 X,而不是与该行关联的产品)的次数,现在在 nSubsqX 中给出。现在,我想知道与 X 的那些后续订单相关的成本总和。我已将答案手动输入到下面的 nCostSubsqX 中,但我不知道如何以编程方式进行。
Date Customer Product cost nSubsqX nCostSubsqX
1 2020-05-18 A X 9 0 0
2 2020-02-10 B X 2 5 42
3 2020-02-12 B Y 3 5 42
4 2020-03-04 B Z 4 5 42
5 2020-03-29 B X 5 4 37
6 2020-04-08 B X 6 3 31
7 2020-04-30 B X 7 2 24
8 2020-05-13 B X 8 1 5
9 2020-05-23 B Y 10 1 5
10 2020-07-02 B Y 11 1 5
11 2020-08-26 B Y 12 1 5
12 2020-12-06 B X 16 0 0
13 2020-01-31 C X 1 3 42
14 2020-09-19 C X 13 2 60
15 2020-10-13 C X 14 1 15
16 2020-11-11 C X 15 0 0
17 2020-12-26 C Y 17 0 0
为了提供 Reprex,下面是创建数据框的代码。
df = data.frame("Date" = as.Date(c("2020-01-31", "2020-02-10", "2020-02-12",
"2020-03-04", "2020-03-29", "2020-04-08", "2020-04-30", "2020-05-13", "2020-05-18",
"2020-05-23", "2020-07-02", "2020-08-26", "2020-09-19", "2020-10-13", "2020-11-11",
"2020-12-06", "2020-12-26")), "Customer" = c("C","B","B","B","B","B","B","B","A",
"B","B","B","C","C","C","B","C"), "Product" = c("X","X","Y","Z","X","X","X","X","X",
"Y","Y","Y","X","X","X","X","Y"))
df$cost = seq(nrow(df))
下面是获取 nSubsqX 的代码:
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"))
现在我需要了解如何使数组成为 Product 为 X 的行,但来自成本列而不是来自 Product 列本身。 有什么想法吗?
尝试1,报错。
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = sum(cost[which(Product == "X")]) - cumsum(cost[which(Product == "X")]))
...
Error in `mutate_cols()`:
Problem with `mutate()` column `nCostSubsqX`.
`nCostSubsqX = sum(cost[which(Product == "X")]) - ...`.
`nCostSubsqX` must be size 11 or 1, not 6.
The error occurred in group 2: Customer = "B".
尝试 2,数学不正确。 nCostSubsqX 列需要删除此时的 cum 成本。
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = zoo::na.locf0(replace(rep(NA_real_, n()),
Product == "X", rev(seq_len(sum(cost[which(Product == "X")]))))))
...
Date Customer Product cost nSubsqX nCostSubsqX
1 2020-05-18 A X 9 0 9
2 2020-02-10 B X 2 5 44
3 2020-02-12 B Y 3 5 44
4 2020-03-04 B Z 4 5 44
5 2020-03-29 B X 5 4 43
6 2020-04-08 B X 6 3 42
7 2020-04-30 B X 7 2 41
8 2020-05-13 B X 8 1 40
9 2020-05-23 B Y 10 1 40
10 2020-07-02 B Y 11 1 40
11 2020-08-26 B Y 12 1 40
12 2020-12-06 B X 16 0 39
13 2020-01-31 C X 1 3 43
14 2020-09-19 C X 13 2 42
15 2020-10-13 C X 14 1 41
16 2020-11-11 C X 15 0 40
17 2020-12-26 C Y 17 0 40
尝试3,我不知道这里的数学在做什么,但这是不对的!
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = zoo::na.locf0(replace(rep(NA_real_, n()),
Product == "X", rev(seq_len(sum(cost[which(Product == "X")])))))-
zoo::na.locf0(ifelse(Product == "X",cumsum(cost[which(Product == "X")]),NA)))
尝试 1 就快完成了。保持行数很重要。将 cost[which(Product == "X")]
替换为 cost*(Product=="X")
(肮脏的把戏)。
顺便提一句。 which
是不必要的。
代码段将是:
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = sum(cost[Product == "X"]) - cumsum(cost*(Product == "X")))
如果您有兴趣,这里有一个稍微不同的方法。
library(data.table)
f <- function(p,co=rep(1,length(p))) {
sapply(seq_along(p), \(i) sum(co[-i:0][p[-i:0]=="X"]))
}
setDT(df)[
order(Date,Customer),
`:=`(nSubsqX = f(Product),nCostSubsqx=f(Product, cost)),
by=Customer
]
在这种方法中,我实际上对 nSubsqX
和 nCostSubsqx
使用相同的函数 f()
;唯一的区别是 cost
是作为 co
参数额外传递给 f()
,还是使用默认的 co
参数。
输出:
Date Customer Product cost nSubsqX nCostSubsqx
<Date> <char> <char> <int> <num> <int>
1: 2020-01-31 C X 1 3 42
2: 2020-02-10 B X 2 5 42
3: 2020-02-12 B Y 3 5 42
4: 2020-03-04 B Z 4 5 42
5: 2020-03-29 B X 5 4 37
6: 2020-04-08 B X 6 3 31
7: 2020-04-30 B X 7 2 24
8: 2020-05-13 B X 8 1 16
9: 2020-05-18 A X 9 0 0
10: 2020-05-23 B Y 10 1 16
11: 2020-07-02 B Y 11 1 16
12: 2020-08-26 B Y 12 1 16
13: 2020-09-19 C X 13 2 29
14: 2020-10-13 C X 14 1 15
15: 2020-11-11 C X 15 0 0
16: 2020-12-06 B X 16 0 0
17: 2020-12-26 C Y 17 0 0
这是对我之前问题的补充:
假设我有下面的数据框。在我之前的问题中,我问过如何在每一行计算该行的客户随后订购产品 X(字面意思是 X,而不是与该行关联的产品)的次数,现在在 nSubsqX 中给出。现在,我想知道与 X 的那些后续订单相关的成本总和。我已将答案手动输入到下面的 nCostSubsqX 中,但我不知道如何以编程方式进行。
Date Customer Product cost nSubsqX nCostSubsqX
1 2020-05-18 A X 9 0 0
2 2020-02-10 B X 2 5 42
3 2020-02-12 B Y 3 5 42
4 2020-03-04 B Z 4 5 42
5 2020-03-29 B X 5 4 37
6 2020-04-08 B X 6 3 31
7 2020-04-30 B X 7 2 24
8 2020-05-13 B X 8 1 5
9 2020-05-23 B Y 10 1 5
10 2020-07-02 B Y 11 1 5
11 2020-08-26 B Y 12 1 5
12 2020-12-06 B X 16 0 0
13 2020-01-31 C X 1 3 42
14 2020-09-19 C X 13 2 60
15 2020-10-13 C X 14 1 15
16 2020-11-11 C X 15 0 0
17 2020-12-26 C Y 17 0 0
为了提供 Reprex,下面是创建数据框的代码。
df = data.frame("Date" = as.Date(c("2020-01-31", "2020-02-10", "2020-02-12",
"2020-03-04", "2020-03-29", "2020-04-08", "2020-04-30", "2020-05-13", "2020-05-18",
"2020-05-23", "2020-07-02", "2020-08-26", "2020-09-19", "2020-10-13", "2020-11-11",
"2020-12-06", "2020-12-26")), "Customer" = c("C","B","B","B","B","B","B","B","A",
"B","B","B","C","C","C","B","C"), "Product" = c("X","X","Y","Z","X","X","X","X","X",
"Y","Y","Y","X","X","X","X","Y"))
df$cost = seq(nrow(df))
下面是获取 nSubsqX 的代码:
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"))
现在我需要了解如何使数组成为 Product 为 X 的行,但来自成本列而不是来自 Product 列本身。 有什么想法吗?
尝试1,报错。
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = sum(cost[which(Product == "X")]) - cumsum(cost[which(Product == "X")]))
...
Error in `mutate_cols()`:
Problem with `mutate()` column `nCostSubsqX`.
`nCostSubsqX = sum(cost[which(Product == "X")]) - ...`.
`nCostSubsqX` must be size 11 or 1, not 6.
The error occurred in group 2: Customer = "B".
尝试 2,数学不正确。 nCostSubsqX 列需要删除此时的 cum 成本。
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = zoo::na.locf0(replace(rep(NA_real_, n()),
Product == "X", rev(seq_len(sum(cost[which(Product == "X")]))))))
...
Date Customer Product cost nSubsqX nCostSubsqX
1 2020-05-18 A X 9 0 9
2 2020-02-10 B X 2 5 44
3 2020-02-12 B Y 3 5 44
4 2020-03-04 B Z 4 5 44
5 2020-03-29 B X 5 4 43
6 2020-04-08 B X 6 3 42
7 2020-04-30 B X 7 2 41
8 2020-05-13 B X 8 1 40
9 2020-05-23 B Y 10 1 40
10 2020-07-02 B Y 11 1 40
11 2020-08-26 B Y 12 1 40
12 2020-12-06 B X 16 0 39
13 2020-01-31 C X 1 3 43
14 2020-09-19 C X 13 2 42
15 2020-10-13 C X 14 1 41
16 2020-11-11 C X 15 0 40
17 2020-12-26 C Y 17 0 40
尝试3,我不知道这里的数学在做什么,但这是不对的!
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = zoo::na.locf0(replace(rep(NA_real_, n()),
Product == "X", rev(seq_len(sum(cost[which(Product == "X")])))))-
zoo::na.locf0(ifelse(Product == "X",cumsum(cost[which(Product == "X")]),NA)))
尝试 1 就快完成了。保持行数很重要。将 cost[which(Product == "X")]
替换为 cost*(Product=="X")
(肮脏的把戏)。
顺便提一句。 which
是不必要的。
代码段将是:
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(
nSubsqX = sum(Product=="X") - cumsum(Product=="X"),
nCostSubsqX = sum(cost[Product == "X"]) - cumsum(cost*(Product == "X")))
如果您有兴趣,这里有一个稍微不同的方法。
library(data.table)
f <- function(p,co=rep(1,length(p))) {
sapply(seq_along(p), \(i) sum(co[-i:0][p[-i:0]=="X"]))
}
setDT(df)[
order(Date,Customer),
`:=`(nSubsqX = f(Product),nCostSubsqx=f(Product, cost)),
by=Customer
]
在这种方法中,我实际上对 nSubsqX
和 nCostSubsqx
使用相同的函数 f()
;唯一的区别是 cost
是作为 co
参数额外传递给 f()
,还是使用默认的 co
参数。
输出:
Date Customer Product cost nSubsqX nCostSubsqx
<Date> <char> <char> <int> <num> <int>
1: 2020-01-31 C X 1 3 42
2: 2020-02-10 B X 2 5 42
3: 2020-02-12 B Y 3 5 42
4: 2020-03-04 B Z 4 5 42
5: 2020-03-29 B X 5 4 37
6: 2020-04-08 B X 6 3 31
7: 2020-04-30 B X 7 2 24
8: 2020-05-13 B X 8 1 16
9: 2020-05-18 A X 9 0 0
10: 2020-05-23 B Y 10 1 16
11: 2020-07-02 B Y 11 1 16
12: 2020-08-26 B Y 12 1 16
13: 2020-09-19 C X 13 2 29
14: 2020-10-13 C X 14 1 15
15: 2020-11-11 C X 15 0 0
16: 2020-12-06 B X 16 0 0
17: 2020-12-26 C Y 17 0 0