在 R 中使用 dplyr 删除具有循环条件的唯一行
Remove Unique Row with Looping Criteria using dplyr in R
这是我的数据
## Data
datex <- c(rep("2021-01-18", 61), rep("2021-01-19", 125))
hourx <- c(0,1,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,16,10,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,11,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7,8,8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,12,12,12,12,12,12,12,12,13,13,13,13,13,13,13,13,14,14,14,14,14,14,14,14,14,15,15,15,15,16,16,16,16)
transaction <- c(1,6,2,5,1,2,1,9,6,12,5,25,14,6,22,9,10,14,15,12,22,12,12,14,9,11,3,3,4,0,1,4,3,1,2,3,3,5,7,5,5,6,9,16,8,13,10,20,15,18,10,19,15,5,13,12,10,12,26,14,0,4,0,0,0,2,0,0,2,0,4,0,6,8,0,2,3,0,2,0,1,0,1,0,2,0,0,2,1,1,0,0,3,0,1,0,3,0,0,6,5,2,0,8,0,0,12,11,0,2,0,11,0,0,14,21,0,0,13,7,0,17,0,0,18,0,7,0,4,4,0,0,7,12,0,13,0,0,13,6,9,0,0,0,16,0,0,16,0,14,0,0,9,0,11,8,0,8,0,0,8,0,10,5,0,15,0,0,3,0,0,8,8,0,0,6,5,0,8,0,0,5,1,0,0,3)
mydata <- data.frame(datex, hourx, seller, product, detail, status, channel, transaction)
我的任务是将0加到组合中。
这就是我的意思。这是一个示例,我想从
中找到变化点
从结果可以看出,datex“2021-01-18”和“2021-01-19”错过了17到23的hourx,所以我们需要在17-23的hourx中加0。我是这样手动做的
如何使用 dplyr 为所有组合自动将 0 添加到缺失的“hourx”中?
非常感谢。
我想你正在寻找这样的东西。
mydata <-
tibble::tribble(
~datex, ~hourx, ~channel, ~transaction,
"05-Mar-21", 1, "dombsdpapp1", 50,
"05-Mar-21", 7, "dombsdpapp1", 100,
"05-Mar-21", 9, "dombsdpapp1", 20,
"05-Mar-21", 5, "dombsdpapp2", 100,
"05-Mar-21", 5, "dombsdpapp3", 75,
"05-Mar-21", 9, "dombsdpapp3", 95,
"05-Mar-21", 10, "dombsdpapp3", 35,
"05-Mar-21", 11, "dombsdpapp3", 60,
"06-Mar-21", 1, "dombsdpapp1", 55,
"06-Mar-21", 13, "dombsdpapp3", 10
)
library(dplyr)
mydata %>%
group_by(channel, hourx) %>%
filter(n() > 1)
#> # A tibble: 2 x 4
#> # Groups: channel, hourx [1]
#> datex hourx channel transaction
#> <chr> <dbl> <chr> <dbl>
#> 1 05-Mar-21 1 dombsdpapp1 50
#> 2 06-Mar-21 1 dombsdpapp1 55
在这种情况下,我提出了滚动计算的解决方案。建议使用库 runner
。
- 首先,检查您的第一个条件相当容易。创建一个 group_by 并创建一个逻辑变量 say
d1
来检查该行是否唯一。
- 第二个条件比较棘手。因此,我们在前一天收集列表变量中任何日期的
hourx
和 channel
的所有组合,例如 d2
.
- 最后,我们将这个列表虚拟变量变异为逻辑变量,以检查当天的组合是否存在于前一天。在这里,我使用
purrr::map2
对列表变量进行变异。
- 剩下的部分只是过滤器,这更容易理解,因为创建的两个虚拟变量都是逻辑的。
library(purrr)
library(dplyr)
library(runner)
mydata %>% mutate(datex = as.Date(datex, "%d-%B-%y")) %>%
group_by(datex, channel) %>%
mutate(d1 = n() > 1) %>%
ungroup() %>%
mutate(d2 = runner(x = paste(hourx, channel),
idx = datex,
k = '1 day',
lag = 1,
f = function(x) list(x)),
d2 = unlist(map2(paste(hourx, channel), d2, ~ .x %in% .y))) %>%
filter(!d1 & !d2)
# A tibble: 2 x 6
datex hourx channel transaction d1 d2
<date> <dbl> <chr> <dbl> <lgl> <lgl>
1 2021-03-05 5 dombsdpapp2 100 FALSE FALSE
2 2021-03-06 13 dombsdpapp3 10 FALSE FALSE
或者,如果您想保留这些行
mydata %>% mutate(datex = as.Date(datex, "%d-%B-%y")) %>%
group_by(datex, channel) %>%
mutate(d1 = n() > 1) %>%
ungroup() %>%
mutate(d2 = runner(x = paste(hourx, channel),
idx = datex,
k = '1 day',
lag = 1,
f = function(x) list(x)),
d2 = unlist(map2(paste(hourx, channel), d2, ~ .x %in% .y))) %>%
filter(d1 | d2)
# A tibble: 8 x 6
datex hourx channel transaction d1 d2
<date> <dbl> <chr> <dbl> <lgl> <lgl>
1 2021-03-05 1 dombsdpapp1 50 TRUE FALSE
2 2021-03-05 7 dombsdpapp1 100 TRUE FALSE
3 2021-03-05 9 dombsdpapp1 20 TRUE FALSE
4 2021-03-05 5 dombsdpapp3 75 TRUE FALSE
5 2021-03-05 9 dombsdpapp3 95 TRUE FALSE
6 2021-03-05 10 dombsdpapp3 35 TRUE FALSE
7 2021-03-05 11 dombsdpapp3 60 TRUE FALSE
8 2021-03-06 1 dombsdpapp1 55 FALSE TRUE
- 不用说,您可以删除虚拟变量
d1
并创建 d2
。
这是我的数据
## Data
datex <- c(rep("2021-01-18", 61), rep("2021-01-19", 125))
hourx <- c(0,1,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,16,10,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,11,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7,8,8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,12,12,12,12,12,12,12,12,13,13,13,13,13,13,13,13,14,14,14,14,14,14,14,14,14,15,15,15,15,16,16,16,16)
transaction <- c(1,6,2,5,1,2,1,9,6,12,5,25,14,6,22,9,10,14,15,12,22,12,12,14,9,11,3,3,4,0,1,4,3,1,2,3,3,5,7,5,5,6,9,16,8,13,10,20,15,18,10,19,15,5,13,12,10,12,26,14,0,4,0,0,0,2,0,0,2,0,4,0,6,8,0,2,3,0,2,0,1,0,1,0,2,0,0,2,1,1,0,0,3,0,1,0,3,0,0,6,5,2,0,8,0,0,12,11,0,2,0,11,0,0,14,21,0,0,13,7,0,17,0,0,18,0,7,0,4,4,0,0,7,12,0,13,0,0,13,6,9,0,0,0,16,0,0,16,0,14,0,0,9,0,11,8,0,8,0,0,8,0,10,5,0,15,0,0,3,0,0,8,8,0,0,6,5,0,8,0,0,5,1,0,0,3)
mydata <- data.frame(datex, hourx, seller, product, detail, status, channel, transaction)
我的任务是将0加到组合中。 这就是我的意思。这是一个示例,我想从
中找到变化点从结果可以看出,datex“2021-01-18”和“2021-01-19”错过了17到23的hourx,所以我们需要在17-23的hourx中加0。我是这样手动做的
如何使用 dplyr 为所有组合自动将 0 添加到缺失的“hourx”中? 非常感谢。
我想你正在寻找这样的东西。
mydata <-
tibble::tribble(
~datex, ~hourx, ~channel, ~transaction,
"05-Mar-21", 1, "dombsdpapp1", 50,
"05-Mar-21", 7, "dombsdpapp1", 100,
"05-Mar-21", 9, "dombsdpapp1", 20,
"05-Mar-21", 5, "dombsdpapp2", 100,
"05-Mar-21", 5, "dombsdpapp3", 75,
"05-Mar-21", 9, "dombsdpapp3", 95,
"05-Mar-21", 10, "dombsdpapp3", 35,
"05-Mar-21", 11, "dombsdpapp3", 60,
"06-Mar-21", 1, "dombsdpapp1", 55,
"06-Mar-21", 13, "dombsdpapp3", 10
)
library(dplyr)
mydata %>%
group_by(channel, hourx) %>%
filter(n() > 1)
#> # A tibble: 2 x 4
#> # Groups: channel, hourx [1]
#> datex hourx channel transaction
#> <chr> <dbl> <chr> <dbl>
#> 1 05-Mar-21 1 dombsdpapp1 50
#> 2 06-Mar-21 1 dombsdpapp1 55
在这种情况下,我提出了滚动计算的解决方案。建议使用库 runner
。
- 首先,检查您的第一个条件相当容易。创建一个 group_by 并创建一个逻辑变量 say
d1
来检查该行是否唯一。 - 第二个条件比较棘手。因此,我们在前一天收集列表变量中任何日期的
hourx
和channel
的所有组合,例如d2
. - 最后,我们将这个列表虚拟变量变异为逻辑变量,以检查当天的组合是否存在于前一天。在这里,我使用
purrr::map2
对列表变量进行变异。 - 剩下的部分只是过滤器,这更容易理解,因为创建的两个虚拟变量都是逻辑的。
library(purrr)
library(dplyr)
library(runner)
mydata %>% mutate(datex = as.Date(datex, "%d-%B-%y")) %>%
group_by(datex, channel) %>%
mutate(d1 = n() > 1) %>%
ungroup() %>%
mutate(d2 = runner(x = paste(hourx, channel),
idx = datex,
k = '1 day',
lag = 1,
f = function(x) list(x)),
d2 = unlist(map2(paste(hourx, channel), d2, ~ .x %in% .y))) %>%
filter(!d1 & !d2)
# A tibble: 2 x 6
datex hourx channel transaction d1 d2
<date> <dbl> <chr> <dbl> <lgl> <lgl>
1 2021-03-05 5 dombsdpapp2 100 FALSE FALSE
2 2021-03-06 13 dombsdpapp3 10 FALSE FALSE
或者,如果您想保留这些行
mydata %>% mutate(datex = as.Date(datex, "%d-%B-%y")) %>%
group_by(datex, channel) %>%
mutate(d1 = n() > 1) %>%
ungroup() %>%
mutate(d2 = runner(x = paste(hourx, channel),
idx = datex,
k = '1 day',
lag = 1,
f = function(x) list(x)),
d2 = unlist(map2(paste(hourx, channel), d2, ~ .x %in% .y))) %>%
filter(d1 | d2)
# A tibble: 8 x 6
datex hourx channel transaction d1 d2
<date> <dbl> <chr> <dbl> <lgl> <lgl>
1 2021-03-05 1 dombsdpapp1 50 TRUE FALSE
2 2021-03-05 7 dombsdpapp1 100 TRUE FALSE
3 2021-03-05 9 dombsdpapp1 20 TRUE FALSE
4 2021-03-05 5 dombsdpapp3 75 TRUE FALSE
5 2021-03-05 9 dombsdpapp3 95 TRUE FALSE
6 2021-03-05 10 dombsdpapp3 35 TRUE FALSE
7 2021-03-05 11 dombsdpapp3 60 TRUE FALSE
8 2021-03-06 1 dombsdpapp1 55 FALSE TRUE
- 不用说,您可以删除虚拟变量
d1
并创建d2
。