统计每组特定值的序列数
Count number of sequences of a specific value per group
假设我有一个数据框,其中包含一个 ID 和一个变量,其中的响应为 ON 或 OFF。
我想计算每组“ON”的 运行 的数量。我几乎做到了这一点,但意识到我的解决方案无法处理组中的第一个或最后一个值,具体取决于我是尝试使用超前还是滞后。
我搜索过 SO 并且可以找到类似的问题,但似乎没有一个与此完全匹配。
id <- c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b","c", "c","c","c","c","c","c","c" )
category <- c("ON", "OFF", "OFF", "ON", "ON", "ON", "OFF", "OFF", "ON", "ON", "OFF", "OFF","OFF","OFF","OFF", "ON", "ON","ON")
dat<-data.frame(id, category)
到目前为止,我的尝试没有奏效,我认为是因为如果 运行 在组中以“ON”开始,它就不会奏效
summary(dat %>% group_by(id)%>% filter(category == "ON", lead(category!="ON"))%>% count(category) %>% arrange(n))
非常感谢任何帮助。我的实际数据集是 40,000 行,有 120 个 ID,在每个 ID 中,类别可能以 ON 或 OFF
开头
输出将是这样的:
# id category n
# a:1 OFF:0 Min. :1
# b:1 ON :2 1st Qu.:1
# c:0 Median :1
# Mean :1
# 3rd Qu.:1
# Max. :1
因此解释是 2 个 id 在任何时候都有一个 运行 的“ON”,运行 的 ON 的中位数(在这个小样本中)是 1
在base-R
中我们可以使用
tapply(dat$category, dat$id, function(x) with(rle(as.character(x)),sum(values == "ON")))
a b c
2 2 1
# step 1
out <- dat %>%
group_by(id) %>%
nest()
# outcome step 1
out
# # A tibble: 3 x 2
# # Groups: id [3]
# id data
# <chr> <list>
# 1 a <tibble [5 x 1]>
# 2 b <tibble [5 x 1]>
# 3 c <tibble [8 x 1]>
# step 2
out <- out %>%
mutate(run = map(data, ~ {
out_map <- rle(.x$category)
out_map <- tibble(length = out_map[[1]], category = out_map[[2]])
return(out_map)
})) %>%
select(-data)
# outcome step 2
out
# # A tibble: 3 x 2
# # Groups: id [3]
# id run
# <chr> <list>
# 1 a <tibble [3 x 2]>
# 2 b <tibble [3 x 2]>
# 3 c <tibble [2 x 2]>
# step 3
out <- out %>%
unnest(cols = c(run)) %>%
# this line lets you filter for category and the minimum line of the run
filter(category == "ON", length > 1) %>%
ungroup() %>%
mutate_if(is.character, as_factor)
out
# # A tibble: 3 x 3
# id length category
# <fct> <int> <fct>
# 1 a 2 ON
# 2 b 2 ON
# 3 c 3 ON
count(out, id, category, sort = TRUE)
# # A tibble: 3 x 3
# id category n
# <fct> <fct> <int>
# 1 a ON 1
# 2 b ON 1
# 3 c ON 1
summary(out)
# id length category
# a:1 Min. :2.000 ON:3
# b:1 1st Qu.:2.000
# c:1 Median :2.000
# Mean :2.333
# 3rd Qu.:2.500
# Max. :3.000
假设我有一个数据框,其中包含一个 ID 和一个变量,其中的响应为 ON 或 OFF。 我想计算每组“ON”的 运行 的数量。我几乎做到了这一点,但意识到我的解决方案无法处理组中的第一个或最后一个值,具体取决于我是尝试使用超前还是滞后。
我搜索过 SO 并且可以找到类似的问题,但似乎没有一个与此完全匹配。
id <- c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b","c", "c","c","c","c","c","c","c" ) category <- c("ON", "OFF", "OFF", "ON", "ON", "ON", "OFF", "OFF", "ON", "ON", "OFF", "OFF","OFF","OFF","OFF", "ON", "ON","ON") dat<-data.frame(id, category)
到目前为止,我的尝试没有奏效,我认为是因为如果 运行 在组中以“ON”开始,它就不会奏效
summary(dat %>% group_by(id)%>% filter(category == "ON", lead(category!="ON"))%>% count(category) %>% arrange(n))
非常感谢任何帮助。我的实际数据集是 40,000 行,有 120 个 ID,在每个 ID 中,类别可能以 ON 或 OFF
开头输出将是这样的:
# id category n
# a:1 OFF:0 Min. :1
# b:1 ON :2 1st Qu.:1
# c:0 Median :1
# Mean :1
# 3rd Qu.:1
# Max. :1
因此解释是 2 个 id 在任何时候都有一个 运行 的“ON”,运行 的 ON 的中位数(在这个小样本中)是 1
在base-R
中我们可以使用
tapply(dat$category, dat$id, function(x) with(rle(as.character(x)),sum(values == "ON")))
a b c
2 2 1
# step 1 out <- dat %>% group_by(id) %>% nest() # outcome step 1 out # # A tibble: 3 x 2 # # Groups: id [3] # id data # <chr> <list> # 1 a <tibble [5 x 1]> # 2 b <tibble [5 x 1]> # 3 c <tibble [8 x 1]> # step 2 out <- out %>% mutate(run = map(data, ~ { out_map <- rle(.x$category) out_map <- tibble(length = out_map[[1]], category = out_map[[2]]) return(out_map) })) %>% select(-data) # outcome step 2 out # # A tibble: 3 x 2 # # Groups: id [3] # id run # <chr> <list> # 1 a <tibble [3 x 2]> # 2 b <tibble [3 x 2]> # 3 c <tibble [2 x 2]> # step 3 out <- out %>% unnest(cols = c(run)) %>% # this line lets you filter for category and the minimum line of the run filter(category == "ON", length > 1) %>% ungroup() %>% mutate_if(is.character, as_factor) out # # A tibble: 3 x 3 # id length category # <fct> <int> <fct> # 1 a 2 ON # 2 b 2 ON # 3 c 3 ON count(out, id, category, sort = TRUE) # # A tibble: 3 x 3 # id category n # <fct> <fct> <int> # 1 a ON 1 # 2 b ON 1 # 3 c ON 1 summary(out) # id length category # a:1 Min. :2.000 ON:3 # b:1 1st Qu.:2.000 # c:1 Median :2.000 # Mean :2.333 # 3rd Qu.:2.500 # Max. :3.000