Group_by 一个唯一的值并找到持续时间,同时满足R中的多个条件(dplyr)
Group_by a unique value and find duration, while satisfying multiple conditions in R (dplyr)
我有一个数据集,样本:
最终,我希望能够将数据分组到 'chunks',其中主题列包含唯一值,文件夹列显示 'Outdata',消息列为空白。我试图找到每个唯一主题的持续时间。 (确保文件夹被过滤为 == "Outdata" ,消息 == "".
这是数据:
Folder DATE Message Subject
Outdata 9/9/2019 5:46:00 Hi
Outdata 9/9/2019 5:46:01 Hi
Outdata 9/9/2019 5:46:02 Hi
Outdata 9/9/2019 5:46:03 hello Hi
Outdata 9/9/2019 5:46:04 hello OK
Outdata 9/10/2019 6:00:01 OK
Outdata 9/10/2019 6:00:02 Sure
In 9/11/2019 7:50:00 hello Sure
In 9/11/2019 7:50:01 hello
我希望代码基本上执行此操作:(将文件夹过滤为 Outdata,将消息过滤为“”,并按唯一的主题分组,以便在前面的条件适用时使用它的持续时间)
Folder DATE Message Subject Duration
Outdata 9/9/2019 5:46:00 Hi
Outdata 9/9/2019 5:46:01 Hi
Outdata 9/9/2019 5:46:02 Hi 2 sec
Outdata 9/10/2019 6:00:01 OK 1 sec
Outdata 9/10/2019 6:00:02 Sure 1 sec
仅当消息为空且文件夹为 Outdata 时才计算唯一主题的持续时间,因此输出如下所示:
gr Duration
Outdata1 2 sec
Outdata2 1 sec
Outdata3 1 sec
我已经包含了输出:
structure(list(Folder = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L), .Label = c("In", "Outdata"), class = "factor"), Date = structure(c(5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L), .Label = c("9/10/2019 6:00:01 AM",
"9/10/2019 6:00:02 AM", "9/11/2019 7:50:00 AM", "9/11/2019 7:50:01 AM",
"9/9/2019 5:46:00 AM", "9/9/2019 5:46:01 AM", "9/9/2019 5:46:02 AM",
"9/9/2019 5:46:03 AM", "9/9/2019 5:46:04 AM"), class = "factor"),
Message = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("",
"hello"), class = "factor"), Subject = structure(c(2L, 2L,
2L, 2L, 3L, 3L, 4L, 4L, 1L), .Label = c("", "Hi", "OK", "Sure"
), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
这是我试过的,效果不错,我只需要考虑一下
消息值也是空的。
library(dplyr)
filterdf<-df[!(df$Message == ""),]
filterdf %>%
group_by(Subject) %>%
mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %I:%M:%S %p"),
gr = cumsum(Folder != lag(Folder, default = TRUE))) %>%
filter(Folder == "Outdata") %>%
arrange(gr, DATE) %>%
group_by(gr) %>%
summarise(Duration = difftime(last(DATE), first(DATE), units = "secs")) %>%
mutate(gr = paste0('Out', row_number()))
我不确定如何满足可以按唯一主题值分组并找到其持续时间的条件,同时满足消息 == "" 和文件夹 == "Outdata" 条件。
感谢任何帮助。
谢谢
更新:
我得到持续时间值都相同的输出。这是我更大样本集的输出
structure(list(Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "b"), class = "factor"),
Folder = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "Outlookdata", class = "factor"),
Message = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "hello"), class = "factor"),
Date = structure(c(1L, 2L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L,
22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L,
34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L,
46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L,
58L, 59L), .Label = c("9/9/2019 5:46:38 PM", "9/9/2019 5:46:40 PM",
"9/9/2019 5:46:42 PM", "9/9/2019 5:46:43 PM", "9/9/2019 5:46:44 PM",
"9/9/2019 5:46:45 PM", "9/9/2019 5:46:46 PM", "9/9/2019 5:46:47 PM",
"9/9/2019 5:46:49 PM", "9/9/2019 5:46:50 PM", "9/9/2019 5:46:51 PM",
"9/9/2019 5:46:52 PM", "9/9/2019 5:46:53 PM", "9/9/2019 5:46:54 PM",
"9/9/2019 5:46:55 PM", "9/9/2019 5:46:56 PM", "9/9/2019 5:46:58 PM",
"9/9/2019 5:46:59 PM", "9/9/2019 5:47:00 PM", "9/9/2019 5:47:01 PM",
"9/9/2019 5:48:27 PM", "9/9/2019 5:48:30 PM", "9/9/2019 5:48:31 PM",
"9/9/2019 5:48:32 PM", "9/9/2019 5:48:33 PM", "9/9/2019 5:48:34 PM",
"9/9/2019 5:48:35 PM", "9/9/2019 5:48:37 PM", "9/9/2019 5:48:38 PM",
"9/9/2019 5:48:39 PM", "9/9/2019 5:48:40 PM", "9/9/2019 5:48:41 PM",
"9/9/2019 5:48:43 PM", "9/9/2019 5:48:44 PM", "9/9/2019 5:48:45 PM",
"9/9/2019 5:48:46 PM", "9/9/2019 5:48:47 PM", "9/9/2019 5:48:48 PM",
"9/9/2019 5:48:50 PM", "9/9/2019 5:48:51 PM", "9/9/2019 5:48:52 PM",
"9/9/2019 5:48:53 PM", "9/9/2019 5:48:54 PM", "9/9/2019 5:48:55 PM",
"9/9/2019 5:48:56 PM", "9/9/2019 5:48:58 PM", "9/9/2019 5:48:59 PM",
"9/9/2019 5:49:00 PM", "9/9/2019 5:49:01 PM", "9/9/2019 5:49:02 PM",
"9/9/2019 5:49:03 PM", "9/9/2019 5:49:04 PM", "9/9/2019 5:49:05 PM",
"9/9/2019 5:49:06 PM", "9/9/2019 5:49:07 PM", "9/9/2019 5:49:08 PM",
"9/9/2019 5:49:09 PM", "9/9/2019 5:49:10 PM", "9/9/2019 5:49:11 PM"
), class = "factor")), class = "data.frame", row.names = c(NA,
-60L))
如果我们包括 'Subject' 列,将有 3 行,因为在我们从 'Folder'
中提取 'Outdata' 之后有 3 个唯一值
library(dplyr)
library(stringr)
library(lubridate)
library(data.table)
df %>%
filter(Folder == 'Outdata') %>% #filter only Outdata rows
mutate(Date = mdy_hms(Date)) %>% # convert to Datetime class
group_by(grp = rleid(Message)) %>% # create a group based on similarity of adjacent elements
filter(all(Message == '')) %>% # rremove the groups where all values in Message are blank
transmute(Subject, Duration = diff(range(Date))) %>% # get the difference of range of dates
ungroup %>%
distinct %>% # get the distinct rows
mutate(grp = str_c("Outdata", row_number())) # update by pasting 'Outdata'
# A tibble: 3 x 3
# grp Subject Duration
# <chr> <fct> <drtn>
#1 Outdata1 Hi 2 secs
#2 Outdata2 OK 1 secs
#3 Outdata3 Sure 1 secs
不包括 'Subject',它将是 2 行
df %>%
filter(Folder == 'Outdata') %>%
mutate(Date = mdy_hms(Date)) %>%
group_by(grp = rleid(Message)) %>%
filter(all(Message == '')) %>%
summarise(Duration = diff(range(Date))) %>%
mutate(grp = str_c("Outdata", row_number()))
# A tibble: 2 x 2
# grp Duration
# <chr> <drtn>
#1 Outdata1 2 secs
#2 Outdata2 1 secs
更新
使用新数据集
df1 %>%
filter(Folder == 'Outlookdata') %>%
mutate(Date = mdy_hms(Date)) %>%
group_by(grp = rleid(Message)) %>%
filter(all(Message == "")) %>%
transmute(Subject, Duration = diff(range(Date))) %>%
ungroup %>%
distinct
# A tibble: 3 x 3
# grp Subject Duration
# <int> <fct> <drtn>
#1 1 A 17 secs
#2 3 A 132 secs
#3 3 b 132 secs
我有一个数据集,样本:
最终,我希望能够将数据分组到 'chunks',其中主题列包含唯一值,文件夹列显示 'Outdata',消息列为空白。我试图找到每个唯一主题的持续时间。 (确保文件夹被过滤为 == "Outdata" ,消息 == "".
这是数据:
Folder DATE Message Subject
Outdata 9/9/2019 5:46:00 Hi
Outdata 9/9/2019 5:46:01 Hi
Outdata 9/9/2019 5:46:02 Hi
Outdata 9/9/2019 5:46:03 hello Hi
Outdata 9/9/2019 5:46:04 hello OK
Outdata 9/10/2019 6:00:01 OK
Outdata 9/10/2019 6:00:02 Sure
In 9/11/2019 7:50:00 hello Sure
In 9/11/2019 7:50:01 hello
我希望代码基本上执行此操作:(将文件夹过滤为 Outdata,将消息过滤为“”,并按唯一的主题分组,以便在前面的条件适用时使用它的持续时间)
Folder DATE Message Subject Duration
Outdata 9/9/2019 5:46:00 Hi
Outdata 9/9/2019 5:46:01 Hi
Outdata 9/9/2019 5:46:02 Hi 2 sec
Outdata 9/10/2019 6:00:01 OK 1 sec
Outdata 9/10/2019 6:00:02 Sure 1 sec
仅当消息为空且文件夹为 Outdata 时才计算唯一主题的持续时间,因此输出如下所示:
gr Duration
Outdata1 2 sec
Outdata2 1 sec
Outdata3 1 sec
我已经包含了输出:
structure(list(Folder = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L), .Label = c("In", "Outdata"), class = "factor"), Date = structure(c(5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L), .Label = c("9/10/2019 6:00:01 AM",
"9/10/2019 6:00:02 AM", "9/11/2019 7:50:00 AM", "9/11/2019 7:50:01 AM",
"9/9/2019 5:46:00 AM", "9/9/2019 5:46:01 AM", "9/9/2019 5:46:02 AM",
"9/9/2019 5:46:03 AM", "9/9/2019 5:46:04 AM"), class = "factor"),
Message = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("",
"hello"), class = "factor"), Subject = structure(c(2L, 2L,
2L, 2L, 3L, 3L, 4L, 4L, 1L), .Label = c("", "Hi", "OK", "Sure"
), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
这是我试过的,效果不错,我只需要考虑一下 消息值也是空的。
library(dplyr)
filterdf<-df[!(df$Message == ""),]
filterdf %>%
group_by(Subject) %>%
mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %I:%M:%S %p"),
gr = cumsum(Folder != lag(Folder, default = TRUE))) %>%
filter(Folder == "Outdata") %>%
arrange(gr, DATE) %>%
group_by(gr) %>%
summarise(Duration = difftime(last(DATE), first(DATE), units = "secs")) %>%
mutate(gr = paste0('Out', row_number()))
我不确定如何满足可以按唯一主题值分组并找到其持续时间的条件,同时满足消息 == "" 和文件夹 == "Outdata" 条件。
感谢任何帮助。 谢谢
更新: 我得到持续时间值都相同的输出。这是我更大样本集的输出
structure(list(Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "b"), class = "factor"),
Folder = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "Outlookdata", class = "factor"),
Message = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "hello"), class = "factor"),
Date = structure(c(1L, 2L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L,
22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L,
34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L,
46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L,
58L, 59L), .Label = c("9/9/2019 5:46:38 PM", "9/9/2019 5:46:40 PM",
"9/9/2019 5:46:42 PM", "9/9/2019 5:46:43 PM", "9/9/2019 5:46:44 PM",
"9/9/2019 5:46:45 PM", "9/9/2019 5:46:46 PM", "9/9/2019 5:46:47 PM",
"9/9/2019 5:46:49 PM", "9/9/2019 5:46:50 PM", "9/9/2019 5:46:51 PM",
"9/9/2019 5:46:52 PM", "9/9/2019 5:46:53 PM", "9/9/2019 5:46:54 PM",
"9/9/2019 5:46:55 PM", "9/9/2019 5:46:56 PM", "9/9/2019 5:46:58 PM",
"9/9/2019 5:46:59 PM", "9/9/2019 5:47:00 PM", "9/9/2019 5:47:01 PM",
"9/9/2019 5:48:27 PM", "9/9/2019 5:48:30 PM", "9/9/2019 5:48:31 PM",
"9/9/2019 5:48:32 PM", "9/9/2019 5:48:33 PM", "9/9/2019 5:48:34 PM",
"9/9/2019 5:48:35 PM", "9/9/2019 5:48:37 PM", "9/9/2019 5:48:38 PM",
"9/9/2019 5:48:39 PM", "9/9/2019 5:48:40 PM", "9/9/2019 5:48:41 PM",
"9/9/2019 5:48:43 PM", "9/9/2019 5:48:44 PM", "9/9/2019 5:48:45 PM",
"9/9/2019 5:48:46 PM", "9/9/2019 5:48:47 PM", "9/9/2019 5:48:48 PM",
"9/9/2019 5:48:50 PM", "9/9/2019 5:48:51 PM", "9/9/2019 5:48:52 PM",
"9/9/2019 5:48:53 PM", "9/9/2019 5:48:54 PM", "9/9/2019 5:48:55 PM",
"9/9/2019 5:48:56 PM", "9/9/2019 5:48:58 PM", "9/9/2019 5:48:59 PM",
"9/9/2019 5:49:00 PM", "9/9/2019 5:49:01 PM", "9/9/2019 5:49:02 PM",
"9/9/2019 5:49:03 PM", "9/9/2019 5:49:04 PM", "9/9/2019 5:49:05 PM",
"9/9/2019 5:49:06 PM", "9/9/2019 5:49:07 PM", "9/9/2019 5:49:08 PM",
"9/9/2019 5:49:09 PM", "9/9/2019 5:49:10 PM", "9/9/2019 5:49:11 PM"
), class = "factor")), class = "data.frame", row.names = c(NA,
-60L))
如果我们包括 'Subject' 列,将有 3 行,因为在我们从 'Folder'
中提取 'Outdata' 之后有 3 个唯一值 library(dplyr)
library(stringr)
library(lubridate)
library(data.table)
df %>%
filter(Folder == 'Outdata') %>% #filter only Outdata rows
mutate(Date = mdy_hms(Date)) %>% # convert to Datetime class
group_by(grp = rleid(Message)) %>% # create a group based on similarity of adjacent elements
filter(all(Message == '')) %>% # rremove the groups where all values in Message are blank
transmute(Subject, Duration = diff(range(Date))) %>% # get the difference of range of dates
ungroup %>%
distinct %>% # get the distinct rows
mutate(grp = str_c("Outdata", row_number())) # update by pasting 'Outdata'
# A tibble: 3 x 3
# grp Subject Duration
# <chr> <fct> <drtn>
#1 Outdata1 Hi 2 secs
#2 Outdata2 OK 1 secs
#3 Outdata3 Sure 1 secs
不包括 'Subject',它将是 2 行
df %>%
filter(Folder == 'Outdata') %>%
mutate(Date = mdy_hms(Date)) %>%
group_by(grp = rleid(Message)) %>%
filter(all(Message == '')) %>%
summarise(Duration = diff(range(Date))) %>%
mutate(grp = str_c("Outdata", row_number()))
# A tibble: 2 x 2
# grp Duration
# <chr> <drtn>
#1 Outdata1 2 secs
#2 Outdata2 1 secs
更新
使用新数据集
df1 %>%
filter(Folder == 'Outlookdata') %>%
mutate(Date = mdy_hms(Date)) %>%
group_by(grp = rleid(Message)) %>%
filter(all(Message == "")) %>%
transmute(Subject, Duration = diff(range(Date))) %>%
ungroup %>%
distinct
# A tibble: 3 x 3
# grp Subject Duration
# <int> <fct> <drtn>
#1 1 A 17 secs
#2 3 A 132 secs
#3 3 b 132 secs