条件序列计数，粘贴特定列

Question

我有一个如下所示的 R 数据框：

id | seq_check | action | ct
123 | end | action_a | 1  
123 | start | action_b | 4  
123 | start | action_c | 1  
456 | end | action_d | 1  
456 | start | action_e | 16  
456 | start | action_f | 4  
456 | start | action_g | 5  
456 | start | action_h | 2  
456 | start | action_i | 1

'end' 标记每个 id 仅出现一次，并且是 id 的特定序列的端点。我想要的是一个如下所示的数据框：

id | seq_action | ct  
123 | action_a <- action_b | 4  
123 | action_a <- action_c | 1  
456 | action_d <- action_e | 16  
456 | action_d <- action_f | 4  
456 | action_d <- action_g | 5  
456 | action_d <- action_h | 2  
456 | action_d <- action_i | 1

有人知道我如何在 R 中做到这一点吗？非常感谢！

Answer 1

我们可以使用data.table。将'data.frame'转换为'data.table'（setDT(df1)），按'id'分组，我们paste中'end'对应的'action' 'seq_check' 与 'action' for 'start' in 'seq_check'，以及子集 'ct' 其中 'seq_check' 是 'start'.

library(data.table)
setDT(df1)[,.(seq_action=paste(action[seq_check=="end"],action[seq_check=="start"],
              sep=" <- "), ct = ct[seq_check=="start"]) , by =  id]
#    id           seq_action ct
#1: 123 action_a <- action_b  4
#2: 123 action_a <- action_c  1
#3: 456 action_d <- action_e 16
#4: 456 action_d <- action_f  4
#5: 456 action_d <- action_g  5
#6: 456 action_d <- action_h  2
#7: 456 action_d <- action_i  1

注意：只使用了一个包。

或使用 na.locf 和 dcast

library(zoo)
dcast(setDT(df1), id+ct~seq_check, value.var = "action")[, .(id, 
              seq_action=paste(na.locf(end), start, sep=" <- "), ct)]

Answer 2

您还可以使用 dplyr 和 tidyr:

library(dplyr); library(tidyr);

spread(df, seq_check, action) %>% fill(end) %>% 
      mutate(seq_action = paste(end, start, sep = " <- ")) %>% 
      select(id, seq_action, ct)

   id           seq_action ct
1 123 action_a <- action_c  1
2 123 action_a <- action_b  4
3 456 action_d <- action_i  1
4 456 action_d <- action_h  2
5 456 action_d <- action_f  4
6 456 action_d <- action_g  5
7 456 action_d <- action_e 16

条件序列计数，粘贴特定列

Conditional Sequency Count, Paste Specific Columns

r

conditional-formatting

dataframe