统计每组特定值的序列数

Count number of sequences of a specific value per group

假设我有一个数据框,其中包含一个 ID 和一个变量,其中的响应为 ON 或 OFF。 我想计算每组“ON”的 运行 的数量。我几乎做到了这一点,但意识到我的解决方案无法处理组中的第一个或最后一个值,具体取决于我是尝试使用超前还是滞后。

我搜索过 SO 并且可以找到类似的问题,但似乎没有一个与此完全匹配。

id <- c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b","c", "c","c","c","c","c","c","c" )
category <- c("ON", "OFF", "OFF", "ON", "ON", "ON", "OFF", "OFF", "ON", "ON", "OFF", "OFF","OFF","OFF","OFF", "ON", "ON","ON")
dat<-data.frame(id, category)

到目前为止,我的尝试没有奏效,我认为是因为如果 运行 在组中以“ON”开始,它就不会奏效

summary(dat %>% group_by(id)%>% filter(category == "ON", lead(category!="ON"))%>% count(category) %>% arrange(n)) 

非常感谢任何帮助。我的实际数据集是 40,000 行,有 120 个 ID,在每个 ID 中,类别可能以 ON 或 OFF

开头

输出将是这样的:

# id    category       n    
# a:1   OFF:0    Min.   :1  
# b:1   ON :2    1st Qu.:1  
# c:0            Median :1  
#                Mean   :1  
#                3rd Qu.:1  
#                Max.   :1 

因此解释是 2 个 id 在任何时候都有一个 运行 的“ON”,运行 的 ON 的中位数(在这个小样本中)是 1

base-R中我们可以使用

tapply(dat$category, dat$id, function(x) with(rle(as.character(x)),sum(values == "ON")))

a b c 
2 2 1 
# step 1
out <- dat %>%
  group_by(id) %>%
  nest()

# outcome step 1
out
# # A tibble: 3 x 2
# # Groups:   id [3]
#   id    data            
#   <chr> <list>          
# 1 a     <tibble [5 x 1]>
# 2 b     <tibble [5 x 1]>
# 3 c     <tibble [8 x 1]>

# step 2
out <- out %>%
  mutate(run = map(data, ~ {
    out_map <- rle(.x$category)
    out_map <- tibble(length = out_map[[1]], category = out_map[[2]])
    return(out_map)
  })) %>%
  select(-data)

# outcome step 2
out
# # A tibble: 3 x 2
# # Groups:   id [3]
#   id    run             
#   <chr> <list>          
# 1 a     <tibble [3 x 2]>
# 2 b     <tibble [3 x 2]>
# 3 c     <tibble [2 x 2]>

# step 3
out <- out %>%
  unnest(cols = c(run)) %>%
  # this line lets you filter for category and the minimum line of the run
  filter(category == "ON", length > 1) %>%
  ungroup() %>%
  mutate_if(is.character, as_factor)
    
out
# # A tibble: 3 x 3
#   id    length category
#   <fct>  <int> <fct>   
# 1 a          2 ON      
# 2 b          2 ON      
# 3 c          3 ON      

count(out, id, category, sort = TRUE)
# # A tibble: 3 x 3
#   id    category     n
#   <fct> <fct>    <int>
# 1 a     ON           1
# 2 b     ON           1
# 3 c     ON           1

summary(out)
#  id        length      category
#  a:1   Min.   :2.000   ON:3    
#  b:1   1st Qu.:2.000           
#  c:1   Median :2.000           
#        Mean   :2.333           
#        3rd Qu.:2.500           
#        Max.   :3.000