用几个条件改变一个变量
Mutate a variable with a few conditions
我有一个数据集,想创建一个名为 stock
的列。具体来说,规则如下。
- 如果代理人在上个月赢得比赛,他们将根据上个月的数量获得
stock
。
- 这个
stock
三个月后消失(例如,如果代理人在 2020-02-01
获胜,则当他们在 2020-06-01
参加比赛时,这只股票消失。
- 有很多代理商
A
、B
等
如何使用 tibble
创建这样的列?
案例一:简单版
date = c("2020-01-01", "2020-02-01", "2020-03-01","2020-04-01", "2020-05-01", "2020-06-01", "2020-07-01", "2020-08-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-08-01")
id = c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B")
win = c(0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)
quantity = c(60, 50, 50, 100, 10, 10, 100, 100, 60, 50, 50, 10, 10, 100)
dat <- tibble(date = as.Date(date), id = id, win = win, quantity = quantity)
date id win quantity stock
<date> <chr> <dbl> <dbl> <dbl>
2020-01-01 A 0 60 0
2020-02-01 A 1 50 0
2020-03-01 A 0 50 50 ## have 50 for 3 months because A win in the previous month
2020-04-01 A 1 100 50
2020-05-01 A 0 10 150 ## have 100 for 3 months because A win in the previous month
2020-06-01 A 0 10 100 ## disappear 50 after 3 months
2020-07-01 A 0 100 100 ## disappear 50 after 3 months
2020-08-01 A 0 100 0 ## disappear 100 after 3 months
2020-01-01 B 0 60 0
2020-02-01 B 0 50 0
2020-03-01 B 0 50 0
2020-04-01 B 1 10 0
2020-05-01 B 0 10 10
2020-08-01 B 0 100 0
案例2:真实数据
gameid = c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "B1", "B2", "B3", "B4", "B5", "B6")
date = c("2020-01-01", "2020-02-01", "2020-03-01","2020-04-01", "2020-05-01", "2020-06-01", "2020-06-01", "2020-07-01", "2020-08-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-08-01")
id = c("A", "A", "A", "A", "A", "A", "A", "A","A", "B", "B", "B", "B", "B", "B")
win = c(0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0)
quantity = c(60, 50, 50, 100, 10, NA, NA, 100, 100, 60, 50, 50, 10, 10, 100)
dat = tibble(gameid = gameid, date = as.Date(date), id = id, win = win, quantity = quantity)
> dat
# A tibble: 15 × 5
gameid date id win quantity
<chr> <date> <chr> <dbl> <dbl>
1 A1 2020-01-01 A 0 60
2 A2 2020-02-01 A 1 50
3 A3 2020-03-01 A 0 50
4 A4 2020-04-01 A 1 100
5 A5 2020-05-01 A 0 10
6 A6 2020-06-01 A 0 NA
7 A7 2020-06-01 A 0 NA # The Case 1 cannot distinguish this part.
8 A8 2020-07-01 A 1 100
9 A9 2020-08-01 A 0 100
10 B1 2020-01-01 B 0 60
11 B2 2020-02-01 B 0 50
12 B3 2020-03-01 B 0 50
13 B4 2020-04-01 B 1 10
14 B5 2020-05-01 B 0 10
15 B6 2020-08-01 B 0 100
完全重写,在历史记录中查找上一个 (incomplete/incorrect) 答案。
虽然这看起来像滚动联接,但您要考虑间隙这一事实意味着它是基于范围的联接或非等值联接。为此,我们将使用以下软件包之一:data.table
、fuzzyjoin
(dplyr 样式)或 sqldf
。我们还将使用 lubridate
来表示“3 个月前”。
data.table
library(lubridate)
library(data.table)
datDT <- as.data.table(dat) # should use setDT(dat) if you're really going this route
datDT[, .(id, fromdate = date, todate = date %m+% months(3),
w2 = win, q2 = quantity)
][datDT, on = .(id, fromdate < date, todate >= date)
][, .(stock = sum(c(0, q2[w2 > 0]))), by = .(gameid, date = fromdate, id, win, quantity)
][ is.na(stock), stock := 0 ][]
# gameid date id win quantity stock
# 1: A1 2020-01-01 A 0 60 0
# 2: A2 2020-02-01 A 1 50 0
# 3: A3 2020-03-01 A 0 50 50
# 4: A4 2020-04-01 A 1 100 50
# 5: A5 2020-05-01 A 0 10 150
# 6: A6 2020-06-01 A 0 NA 100
# 7: A7 2020-06-01 A 0 NA 100
# 8: A8 2020-07-01 A 1 100 100
# 9: A9 2020-08-01 A 0 100 100
# 10: B1 2020-01-01 B 0 60 0
# 11: B2 2020-02-01 B 0 50 0
# 12: B3 2020-03-01 B 0 50 0
# 13: B4 2020-04-01 B 1 10 0
# 14: B5 2020-05-01 B 0 10 10
# 15: B6 2020-08-01 B 0 100 0
模糊连接
library(dplyr)
# library(fuzzyjoin) # fuzzy_left_join
library(lubridate)
fuzzyjoin::fuzzy_left_join(
dat, transform(dat, todate = date %m+% months(3)),
by = c(id = "id", date = "date", date = "todate"),
match_fun = list(`==`, `>`, `<=`)) %>%
group_by(gameid = gameid.x, date = date.x, id = id.x, win = win.x, quantity = quantity.x) %>%
summarize(stock = sum(quantity.y[win.y > 0]), .groups = "drop") %>%
mutate(stock = coalesce(stock, 0)) %>%
arrange(id, date)
# # A tibble: 15 x 6
# gameid date id win quantity stock
# <chr> <date> <chr> <dbl> <dbl> <dbl>
# 1 A1 2020-01-01 A 0 60 0
# 2 A2 2020-02-01 A 1 50 0
# 3 A3 2020-03-01 A 0 50 50
# 4 A4 2020-04-01 A 1 100 50
# 5 A5 2020-05-01 A 0 10 150
# 6 A6 2020-06-01 A 0 NA 100
# 7 A7 2020-06-01 A 0 NA 100
# 8 A8 2020-07-01 A 1 100 100
# 9 A9 2020-08-01 A 0 100 100
# 10 B1 2020-01-01 B 0 60 0
# 11 B2 2020-02-01 B 0 50 0
# 12 B3 2020-03-01 B 0 50 0
# 13 B4 2020-04-01 B 1 10 0
# 14 B5 2020-05-01 B 0 10 10
# 15 B6 2020-08-01 B 0 100 0\
sqldf
这是使用 SQLite 后端,但大多数(全部?)sqldf
的后端应该只需很少的改动即可工作。我们这里预先把todate
算进原帧
library(lubridate)
sqldf::sqldf("
select t1.gameid, t1.date, t1.id, t1.win, t1.quantity,
sum(case when t2.win > 0 then t2.quantity else 0 end) as stock
from dat t1
left join dat t2 on t1.id = t2.id
and t2.todate between t1.date and (t1.todate - 1)
group by t1.gameid, t1.date, t1.id, t1.win, t1.quantity
order by t1.id, t1.date")
# gameid date id win quantity stock
# 1 A1 2020-01-01 A 0 60 0
# 2 A2 2020-02-01 A 1 50 0
# 3 A3 2020-03-01 A 0 50 50
# 4 A4 2020-04-01 A 1 100 50
# 5 A5 2020-05-01 A 0 10 150
# 6 A6 2020-06-01 A 0 NA 100
# 7 A7 2020-06-01 A 0 NA 100
# 8 A8 2020-07-01 A 1 100 100
# 9 A9 2020-08-01 A 0 100 100
# 10 B1 2020-01-01 B 0 60 0
# 11 B2 2020-02-01 B 0 50 0
# 12 B3 2020-03-01 B 0 50 0
# 13 B4 2020-04-01 B 1 10 0
# 14 B5 2020-05-01 B 0 10 10
# 15 B6 2020-08-01 B 0 100 0
我有一个数据集,想创建一个名为 stock
的列。具体来说,规则如下。
- 如果代理人在上个月赢得比赛,他们将根据上个月的数量获得
stock
。 - 这个
stock
三个月后消失(例如,如果代理人在2020-02-01
获胜,则当他们在2020-06-01
参加比赛时,这只股票消失。 - 有很多代理商
A
、B
等
如何使用 tibble
创建这样的列?
案例一:简单版
date = c("2020-01-01", "2020-02-01", "2020-03-01","2020-04-01", "2020-05-01", "2020-06-01", "2020-07-01", "2020-08-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-08-01")
id = c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B")
win = c(0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)
quantity = c(60, 50, 50, 100, 10, 10, 100, 100, 60, 50, 50, 10, 10, 100)
dat <- tibble(date = as.Date(date), id = id, win = win, quantity = quantity)
date id win quantity stock
<date> <chr> <dbl> <dbl> <dbl>
2020-01-01 A 0 60 0
2020-02-01 A 1 50 0
2020-03-01 A 0 50 50 ## have 50 for 3 months because A win in the previous month
2020-04-01 A 1 100 50
2020-05-01 A 0 10 150 ## have 100 for 3 months because A win in the previous month
2020-06-01 A 0 10 100 ## disappear 50 after 3 months
2020-07-01 A 0 100 100 ## disappear 50 after 3 months
2020-08-01 A 0 100 0 ## disappear 100 after 3 months
2020-01-01 B 0 60 0
2020-02-01 B 0 50 0
2020-03-01 B 0 50 0
2020-04-01 B 1 10 0
2020-05-01 B 0 10 10
2020-08-01 B 0 100 0
案例2:真实数据
gameid = c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "B1", "B2", "B3", "B4", "B5", "B6")
date = c("2020-01-01", "2020-02-01", "2020-03-01","2020-04-01", "2020-05-01", "2020-06-01", "2020-06-01", "2020-07-01", "2020-08-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-08-01")
id = c("A", "A", "A", "A", "A", "A", "A", "A","A", "B", "B", "B", "B", "B", "B")
win = c(0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0)
quantity = c(60, 50, 50, 100, 10, NA, NA, 100, 100, 60, 50, 50, 10, 10, 100)
dat = tibble(gameid = gameid, date = as.Date(date), id = id, win = win, quantity = quantity)
> dat
# A tibble: 15 × 5
gameid date id win quantity
<chr> <date> <chr> <dbl> <dbl>
1 A1 2020-01-01 A 0 60
2 A2 2020-02-01 A 1 50
3 A3 2020-03-01 A 0 50
4 A4 2020-04-01 A 1 100
5 A5 2020-05-01 A 0 10
6 A6 2020-06-01 A 0 NA
7 A7 2020-06-01 A 0 NA # The Case 1 cannot distinguish this part.
8 A8 2020-07-01 A 1 100
9 A9 2020-08-01 A 0 100
10 B1 2020-01-01 B 0 60
11 B2 2020-02-01 B 0 50
12 B3 2020-03-01 B 0 50
13 B4 2020-04-01 B 1 10
14 B5 2020-05-01 B 0 10
15 B6 2020-08-01 B 0 100
完全重写,在历史记录中查找上一个 (incomplete/incorrect) 答案。
虽然这看起来像滚动联接,但您要考虑间隙这一事实意味着它是基于范围的联接或非等值联接。为此,我们将使用以下软件包之一:data.table
、fuzzyjoin
(dplyr 样式)或 sqldf
。我们还将使用 lubridate
来表示“3 个月前”。
data.table
library(lubridate)
library(data.table)
datDT <- as.data.table(dat) # should use setDT(dat) if you're really going this route
datDT[, .(id, fromdate = date, todate = date %m+% months(3),
w2 = win, q2 = quantity)
][datDT, on = .(id, fromdate < date, todate >= date)
][, .(stock = sum(c(0, q2[w2 > 0]))), by = .(gameid, date = fromdate, id, win, quantity)
][ is.na(stock), stock := 0 ][]
# gameid date id win quantity stock
# 1: A1 2020-01-01 A 0 60 0
# 2: A2 2020-02-01 A 1 50 0
# 3: A3 2020-03-01 A 0 50 50
# 4: A4 2020-04-01 A 1 100 50
# 5: A5 2020-05-01 A 0 10 150
# 6: A6 2020-06-01 A 0 NA 100
# 7: A7 2020-06-01 A 0 NA 100
# 8: A8 2020-07-01 A 1 100 100
# 9: A9 2020-08-01 A 0 100 100
# 10: B1 2020-01-01 B 0 60 0
# 11: B2 2020-02-01 B 0 50 0
# 12: B3 2020-03-01 B 0 50 0
# 13: B4 2020-04-01 B 1 10 0
# 14: B5 2020-05-01 B 0 10 10
# 15: B6 2020-08-01 B 0 100 0
模糊连接
library(dplyr)
# library(fuzzyjoin) # fuzzy_left_join
library(lubridate)
fuzzyjoin::fuzzy_left_join(
dat, transform(dat, todate = date %m+% months(3)),
by = c(id = "id", date = "date", date = "todate"),
match_fun = list(`==`, `>`, `<=`)) %>%
group_by(gameid = gameid.x, date = date.x, id = id.x, win = win.x, quantity = quantity.x) %>%
summarize(stock = sum(quantity.y[win.y > 0]), .groups = "drop") %>%
mutate(stock = coalesce(stock, 0)) %>%
arrange(id, date)
# # A tibble: 15 x 6
# gameid date id win quantity stock
# <chr> <date> <chr> <dbl> <dbl> <dbl>
# 1 A1 2020-01-01 A 0 60 0
# 2 A2 2020-02-01 A 1 50 0
# 3 A3 2020-03-01 A 0 50 50
# 4 A4 2020-04-01 A 1 100 50
# 5 A5 2020-05-01 A 0 10 150
# 6 A6 2020-06-01 A 0 NA 100
# 7 A7 2020-06-01 A 0 NA 100
# 8 A8 2020-07-01 A 1 100 100
# 9 A9 2020-08-01 A 0 100 100
# 10 B1 2020-01-01 B 0 60 0
# 11 B2 2020-02-01 B 0 50 0
# 12 B3 2020-03-01 B 0 50 0
# 13 B4 2020-04-01 B 1 10 0
# 14 B5 2020-05-01 B 0 10 10
# 15 B6 2020-08-01 B 0 100 0\
sqldf
这是使用 SQLite 后端,但大多数(全部?)sqldf
的后端应该只需很少的改动即可工作。我们这里预先把todate
算进原帧
library(lubridate)
sqldf::sqldf("
select t1.gameid, t1.date, t1.id, t1.win, t1.quantity,
sum(case when t2.win > 0 then t2.quantity else 0 end) as stock
from dat t1
left join dat t2 on t1.id = t2.id
and t2.todate between t1.date and (t1.todate - 1)
group by t1.gameid, t1.date, t1.id, t1.win, t1.quantity
order by t1.id, t1.date")
# gameid date id win quantity stock
# 1 A1 2020-01-01 A 0 60 0
# 2 A2 2020-02-01 A 1 50 0
# 3 A3 2020-03-01 A 0 50 50
# 4 A4 2020-04-01 A 1 100 50
# 5 A5 2020-05-01 A 0 10 150
# 6 A6 2020-06-01 A 0 NA 100
# 7 A7 2020-06-01 A 0 NA 100
# 8 A8 2020-07-01 A 1 100 100
# 9 A9 2020-08-01 A 0 100 100
# 10 B1 2020-01-01 B 0 60 0
# 11 B2 2020-02-01 B 0 50 0
# 12 B3 2020-03-01 B 0 50 0
# 13 B4 2020-04-01 B 1 10 0
# 14 B5 2020-05-01 B 0 10 10
# 15 B6 2020-08-01 B 0 100 0