如何计算 r dplyr mutate 中的条件行数?
How can I count a number of conditional rows within r dplyr mutate?
我想获取一组数据,将其按一列分组,按另一列排序,然后计算某个事件的后续实例发生次数。例如,在下面的数据中......我想添加一个列,称之为 nSubsqX,它告诉我在每一行,对于那个客户,有多少 subsequent 订单有产品“X ”。第 1 行应该是第 3 行,因为行 13:15 都是客户 C,产品 X;第 9 行的结果应为 0,因为客户 A 没有后续订单。
Date Customer Product
1 2020-01-31 C X
2 2020-02-10 B X
3 2020-02-12 B Y
4 2020-03-04 B Z
5 2020-03-29 B X
6 2020-04-08 B X
7 2020-04-30 B X
8 2020-05-13 B X
9 2020-05-18 A X
10 2020-05-23 B Y
11 2020-07-02 B Y
12 2020-08-26 B Y
13 2020-09-19 C X
14 2020-10-13 C X
15 2020-11-11 C X
16 2020-12-06 B X
17 2020-12-26 C Y
为了提供 Reprex,下面是创建数据框的代码。
df = data.frame("Date" = as.Date(c("2020-01-31", "2020-02-10", "2020-02-12",
"2020-03-04", "2020-03-29", "2020-04-08", "2020-04-30", "2020-05-13", "2020-05-18",
"2020-05-23", "2020-07-02", "2020-08-26", "2020-09-19", "2020-10-13", "2020-11-11",
"2020-12-06", "2020-12-26")), "Customer" = c("C","B","B","B","B","B","B","B","A",
"B","B","B","C","C","C","B","C"), "Product" = c("X","X","Y","Z","X","X","X","X","X",
"Y","Y","Y","X","X","X","X","Y"))
我预计我将需要某种变异函数,但我不能完全正确,我试过了:
df2 = df %>%
group_by(Customer) %>%
arrange(Customer, Date) %>%
mutate(
nSubsqX = length(Customer[which(Product == "X")]))
它给出了“X”出现的总次数,但我想要的是这个随后的次数。我也试过:
df2 = df %>%
group_by(Customer) %>%
arrange(Customer, Date) %>%
mutate(
nSubsqX = length(Customer[which(Product == "X" & Date > Date)]))
returns 0 可能是因为 Date > Date 没有任何意义。我需要一种方法来表示日期 > THIS 日期。我试图实现的解决方案如下所示:
Date Customer Product nSubsqX
1 2020-05-18 A X 0
2 2020-02-10 B X 5
3 2020-02-12 B Y 5
4 2020-03-04 B Z 5
5 2020-03-29 B X 4
6 2020-04-08 B X 3
7 2020-04-30 B X 2
8 2020-05-13 B X 1
9 2020-05-23 B Y 1
10 2020-07-02 B Y 1
11 2020-08-26 B Y 1
12 2020-12-06 B X 0
13 2020-01-31 C X 3
14 2020-09-19 C X 2
15 2020-10-13 C X 1
16 2020-11-11 C X 0
17 2020-12-26 C Y 0
我认为这只是一个甚至不知道要搜索什么词的问题,所以我确信如果我能找出正确的搜索条件,那里有一些东西可以告诉我该怎么做。我感谢任何人给我指出正确的方向。
谢谢!
这样的事情怎么样:
library(data.table)
setDT(df)[order(Customer,Date)] %>%
.[Product=="X", nSubsqX:=.N-1:.N, by=.(Customer)] %>%
.[order(Customer,Date),nSubsqX:=zoo::na.locf(nSubsqX)] %>%
.[]
输出:
Date Customer Product nSubsqX
<Date> <char> <char> <int>
1: 2020-05-18 A X 0
2: 2020-02-10 B X 5
3: 2020-02-12 B Y 5
4: 2020-03-04 B Z 5
5: 2020-03-29 B X 4
6: 2020-04-08 B X 3
7: 2020-04-30 B X 2
8: 2020-05-13 B X 1
9: 2020-05-23 B Y 1
10: 2020-07-02 B Y 1
11: 2020-08-26 B Y 1
12: 2020-12-06 B X 0
13: 2020-01-31 C X 3
14: 2020-09-19 C X 2
15: 2020-10-13 C X 1
16: 2020-11-11 C X 0
17: 2020-12-26 C Y 0
data.table
解释:
- 使用
setDT()
将df设置为data.table
- 我们按客户顺序,然后按日期
- 在
i
中,我们限制为 Product = "X"
- 在
j
中,我们创建(按组)nSubsqX
,方法是将值分配为组中的行数(即 .N
)减去行号组的(可以作为序列动态生成,如 1:.N
)
- 在
by
中,我们设置分组列;在这种情况下,我们想按 Customer
分组
- 使用zoo::na.locf 填充
这是一个选项 tidyverse
- arrange
by 'Customer', 'Date', 然后按 'Customer', replace
向量分组NA
元素,其中 'Product' 是 'X',rev
是 'X' 值计数的错误序列,那么我们要么使用 tidyr::fill
要么可以使用 zoo::na.locf0
用以前的 non-NA 值
填充 NA 元素
library(dplyr)
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(new = zoo::na.locf0(replace(rep(NA_real_, n()),
Product == "X", rev(seq_len(sum(Product == "X")))-1))) %>%
ungroup
-输出
# A tibble: 17 × 4
Date Customer Product new
<date> <chr> <chr> <dbl>
1 2020-05-18 A X 0
2 2020-02-10 B X 5
3 2020-02-12 B Y 5
4 2020-03-04 B Z 5
5 2020-03-29 B X 4
6 2020-04-08 B X 3
7 2020-04-30 B X 2
8 2020-05-13 B X 1
9 2020-05-23 B Y 1
10 2020-07-02 B Y 1
11 2020-08-26 B Y 1
12 2020-12-06 B X 0
13 2020-01-31 C X 3
14 2020-09-19 C X 2
15 2020-10-13 C X 1
16 2020-11-11 C X 0
17 2020-12-26 C Y 0
类似的选项可以用data.table
完成
library(data.table)
setDT(df)[order(Customer, Date)][Product == "X",
nSubsqx := rev(seq_len(.N)) - 1, Customer][,
nSubsqx := nafill(nSubsqx, "locf"), Customer][]
-输出
ndex: <Product>
Date Customer Product nSubsqx
<Date> <char> <char> <num>
1: 2020-05-18 A X 0
2: 2020-02-10 B X 5
3: 2020-02-12 B Y 5
4: 2020-03-04 B Z 5
5: 2020-03-29 B X 4
6: 2020-04-08 B X 3
7: 2020-04-30 B X 2
8: 2020-05-13 B X 1
9: 2020-05-23 B Y 1
10: 2020-07-02 B Y 1
11: 2020-08-26 B Y 1
12: 2020-12-06 B X 0
13: 2020-01-31 C X 3
14: 2020-09-19 C X 2
15: 2020-10-13 C X 1
16: 2020-11-11 C X 0
17: 2020-12-26 C Y 0
这是一个 dplyr
唯一的解决方案:
诀窍是从X的总和中减去X的分组数(例如cumsum(Product=="X")
(例如每个Customer
组中的sum(Product=="X")
:
library(dplyr)
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(nSubsqX1 = sum(Product=="X") - cumsum(Product=="X"))
Date Customer Product nSubsqX1
<date> <chr> <chr> <int>
1 2020-05-18 A X 0
2 2020-02-10 B X 5
3 2020-02-12 B Y 5
4 2020-03-04 B Z 5
5 2020-03-29 B X 4
6 2020-04-08 B X 3
7 2020-04-30 B X 2
8 2020-05-13 B X 1
9 2020-05-23 B Y 1
10 2020-07-02 B Y 1
11 2020-08-26 B Y 1
12 2020-12-06 B X 0
13 2020-01-31 C X 3
14 2020-09-19 C X 2
15 2020-10-13 C X 1
16 2020-11-11 C X 0
17 2020-12-26 C Y 0
我想获取一组数据,将其按一列分组,按另一列排序,然后计算某个事件的后续实例发生次数。例如,在下面的数据中......我想添加一个列,称之为 nSubsqX,它告诉我在每一行,对于那个客户,有多少 subsequent 订单有产品“X ”。第 1 行应该是第 3 行,因为行 13:15 都是客户 C,产品 X;第 9 行的结果应为 0,因为客户 A 没有后续订单。
Date Customer Product
1 2020-01-31 C X
2 2020-02-10 B X
3 2020-02-12 B Y
4 2020-03-04 B Z
5 2020-03-29 B X
6 2020-04-08 B X
7 2020-04-30 B X
8 2020-05-13 B X
9 2020-05-18 A X
10 2020-05-23 B Y
11 2020-07-02 B Y
12 2020-08-26 B Y
13 2020-09-19 C X
14 2020-10-13 C X
15 2020-11-11 C X
16 2020-12-06 B X
17 2020-12-26 C Y
为了提供 Reprex,下面是创建数据框的代码。
df = data.frame("Date" = as.Date(c("2020-01-31", "2020-02-10", "2020-02-12",
"2020-03-04", "2020-03-29", "2020-04-08", "2020-04-30", "2020-05-13", "2020-05-18",
"2020-05-23", "2020-07-02", "2020-08-26", "2020-09-19", "2020-10-13", "2020-11-11",
"2020-12-06", "2020-12-26")), "Customer" = c("C","B","B","B","B","B","B","B","A",
"B","B","B","C","C","C","B","C"), "Product" = c("X","X","Y","Z","X","X","X","X","X",
"Y","Y","Y","X","X","X","X","Y"))
我预计我将需要某种变异函数,但我不能完全正确,我试过了:
df2 = df %>%
group_by(Customer) %>%
arrange(Customer, Date) %>%
mutate(
nSubsqX = length(Customer[which(Product == "X")]))
它给出了“X”出现的总次数,但我想要的是这个随后的次数。我也试过:
df2 = df %>%
group_by(Customer) %>%
arrange(Customer, Date) %>%
mutate(
nSubsqX = length(Customer[which(Product == "X" & Date > Date)]))
returns 0 可能是因为 Date > Date 没有任何意义。我需要一种方法来表示日期 > THIS 日期。我试图实现的解决方案如下所示:
Date Customer Product nSubsqX
1 2020-05-18 A X 0
2 2020-02-10 B X 5
3 2020-02-12 B Y 5
4 2020-03-04 B Z 5
5 2020-03-29 B X 4
6 2020-04-08 B X 3
7 2020-04-30 B X 2
8 2020-05-13 B X 1
9 2020-05-23 B Y 1
10 2020-07-02 B Y 1
11 2020-08-26 B Y 1
12 2020-12-06 B X 0
13 2020-01-31 C X 3
14 2020-09-19 C X 2
15 2020-10-13 C X 1
16 2020-11-11 C X 0
17 2020-12-26 C Y 0
我认为这只是一个甚至不知道要搜索什么词的问题,所以我确信如果我能找出正确的搜索条件,那里有一些东西可以告诉我该怎么做。我感谢任何人给我指出正确的方向。
谢谢!
这样的事情怎么样:
library(data.table)
setDT(df)[order(Customer,Date)] %>%
.[Product=="X", nSubsqX:=.N-1:.N, by=.(Customer)] %>%
.[order(Customer,Date),nSubsqX:=zoo::na.locf(nSubsqX)] %>%
.[]
输出:
Date Customer Product nSubsqX
<Date> <char> <char> <int>
1: 2020-05-18 A X 0
2: 2020-02-10 B X 5
3: 2020-02-12 B Y 5
4: 2020-03-04 B Z 5
5: 2020-03-29 B X 4
6: 2020-04-08 B X 3
7: 2020-04-30 B X 2
8: 2020-05-13 B X 1
9: 2020-05-23 B Y 1
10: 2020-07-02 B Y 1
11: 2020-08-26 B Y 1
12: 2020-12-06 B X 0
13: 2020-01-31 C X 3
14: 2020-09-19 C X 2
15: 2020-10-13 C X 1
16: 2020-11-11 C X 0
17: 2020-12-26 C Y 0
data.table
解释:
- 使用
setDT()
将df设置为data.table - 我们按客户顺序,然后按日期
- 在
i
中,我们限制为 Product = "X" - 在
j
中,我们创建(按组)nSubsqX
,方法是将值分配为组中的行数(即.N
)减去行号组的(可以作为序列动态生成,如1:.N
) - 在
by
中,我们设置分组列;在这种情况下,我们想按Customer
分组
- 使用zoo::na.locf 填充
这是一个选项 tidyverse
- arrange
by 'Customer', 'Date', 然后按 'Customer', replace
向量分组NA
元素,其中 'Product' 是 'X',rev
是 'X' 值计数的错误序列,那么我们要么使用 tidyr::fill
要么可以使用 zoo::na.locf0
用以前的 non-NA 值
library(dplyr)
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(new = zoo::na.locf0(replace(rep(NA_real_, n()),
Product == "X", rev(seq_len(sum(Product == "X")))-1))) %>%
ungroup
-输出
# A tibble: 17 × 4
Date Customer Product new
<date> <chr> <chr> <dbl>
1 2020-05-18 A X 0
2 2020-02-10 B X 5
3 2020-02-12 B Y 5
4 2020-03-04 B Z 5
5 2020-03-29 B X 4
6 2020-04-08 B X 3
7 2020-04-30 B X 2
8 2020-05-13 B X 1
9 2020-05-23 B Y 1
10 2020-07-02 B Y 1
11 2020-08-26 B Y 1
12 2020-12-06 B X 0
13 2020-01-31 C X 3
14 2020-09-19 C X 2
15 2020-10-13 C X 1
16 2020-11-11 C X 0
17 2020-12-26 C Y 0
类似的选项可以用data.table
library(data.table)
setDT(df)[order(Customer, Date)][Product == "X",
nSubsqx := rev(seq_len(.N)) - 1, Customer][,
nSubsqx := nafill(nSubsqx, "locf"), Customer][]
-输出
ndex: <Product>
Date Customer Product nSubsqx
<Date> <char> <char> <num>
1: 2020-05-18 A X 0
2: 2020-02-10 B X 5
3: 2020-02-12 B Y 5
4: 2020-03-04 B Z 5
5: 2020-03-29 B X 4
6: 2020-04-08 B X 3
7: 2020-04-30 B X 2
8: 2020-05-13 B X 1
9: 2020-05-23 B Y 1
10: 2020-07-02 B Y 1
11: 2020-08-26 B Y 1
12: 2020-12-06 B X 0
13: 2020-01-31 C X 3
14: 2020-09-19 C X 2
15: 2020-10-13 C X 1
16: 2020-11-11 C X 0
17: 2020-12-26 C Y 0
这是一个 dplyr
唯一的解决方案:
诀窍是从X的总和中减去X的分组数(例如cumsum(Product=="X")
(例如每个Customer
组中的sum(Product=="X")
:
library(dplyr)
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(nSubsqX1 = sum(Product=="X") - cumsum(Product=="X"))
Date Customer Product nSubsqX1
<date> <chr> <chr> <int>
1 2020-05-18 A X 0
2 2020-02-10 B X 5
3 2020-02-12 B Y 5
4 2020-03-04 B Z 5
5 2020-03-29 B X 4
6 2020-04-08 B X 3
7 2020-04-30 B X 2
8 2020-05-13 B X 1
9 2020-05-23 B Y 1
10 2020-07-02 B Y 1
11 2020-08-26 B Y 1
12 2020-12-06 B X 0
13 2020-01-31 C X 3
14 2020-09-19 C X 2
15 2020-10-13 C X 1
16 2020-11-11 C X 0
17 2020-12-26 C Y 0