根据 R 中的模式重新格式化数据

Reformate data based on pattern in R

希望你能帮我解决这个问题,我有这样的数据:

ID,colour
1,base_yellow
1,blue
1,base_red
1,blue
1,pink
1,blue
1,base_yellow
2,base_yellow
2,blue
2,base_red
2,blue
2,pink
2,blue
2,base_yellow
3,base_yellow
3,blue
3,pink
3,blue
3,base_yellow
4,base_yellow
4,blue
4,green
4,blue
4,green
4,blue
4,pink
4,blue
4,base_yellow

每次遇到base(base_yellow,base_red),都会创建新的group,输出如下所示,给出一个新的变量:

ID,colour
1,base_yellow; blue; base_red
1,base_red; blue; pink;blue;base_yellow
2,base_yellow; blue; base_red
2,base_red; blue; pink;blue; base_yellow
3,base_yellow;blue;pinkblue;base_yellow
4,base_yellow; blue;green;blue;green;blue;pink;blue;base_yellow

试试这个:

library(tidyverse)

# Read data
mydata <- tibble::tribble(~ID,~colour,
                          1,"base_yellow",
                          1,"blue",
                          1,"base_red",
                          1,"blue",
                          1,"pink",
                          1,"blue",
                          1,"base_yellow",
                          2,"base_yellow",
                          2,"blue",
                          2,"base_red",
                          2,"blue",
                          2,"pink",
                          2,"blue",
                          2,"base_yellow",
                          3,"base_yellow",
                          3,"blue",
                          3,"pink",
                          3,"blue",
                          3,"base_yellow",
                          4,"base_yellow",
                          4,"blue",
                          4,"green",
                          4,"blue",
                          4,"green",
                          4,"blue",
                          4,"pink",
                          4,"blue",
                          4,"base_yellow")

# Add column to group by words starting with "base_"
mydata <- mydata %>% 
  mutate(base = str_starts(colour, "base_")) %>% 
  mutate(base = ifelse(base, colour, NA)) %>% 
  fill(base, .direction = "down")

# Group by ID and words starting with "base_" and paste words
mydata <- mydata %>% 
  group_by(ID, base) %>% 
  summarise(colour = paste(colour, collapse = ";")) %>% 
  select(-base)

结果:

> mydata
# A tibble: 6 × 2
# Groups:   ID [4]
     ID colour                                                      
  <dbl> <chr>                                                       
1     1 base_red;blue;pink;blue                                     
2     1 base_yellow;blue;base_yellow                                
3     2 base_red;blue;pink;blue                                     
4     2 base_yellow;blue;base_yellow                                
5     3 base_yellow;blue;pink;blue;base_yellow                      
6     4 base_yellow;blue;green;blue;green;blue;pink;blue;base_yellow

您可以根据自己的需要进行调整。

首先,创建一个向量 vec,其中包含 colour 以“base”开头的行位置。

然后,您可以使用 purrr 中的 map2_dfr,它将提供 colour,其范围从开始到结束位置基于 vec。这将有助于最终在多行中使用相同 colour 的情况。分组变量 group 也在此步骤中创建。

group 分组后,您只能保留 colour 个具有多个 colourstr_c 的组,以便将它们折叠在一起以获得相同的 group.

library(tidyverse)

vec <- which(grepl("^base", df$colour))

map2_dfr(
  vec[-length(vec)],
  vec[-1],
  ~df[.x:.y, ],
  .id = "group"
) %>%
  group_by(group) %>%
  filter(n_distinct(colour) > 1) %>%
  summarise(ID = first(ID), colour = str_c(colour, collapse = "; ")) %>%
  select(-group)

输出

     ID colour                                                              
  <int> <chr>                                                               
1     1 base_yellow; blue; base_red                                         
2     1 base_red; blue; pink; blue; base_yellow                             
3     2 base_yellow; blue; base_red                                         
4     2 base_red; blue; pink; blue; base_yellow                             
5     3 base_yellow; blue; pink; blue; base_yellow                          
6     4 base_yellow; blue; green; blue; green; blue; pink; blue; base_yellow