杂乱的字符串列到宽格式

Messy string column to wide format

我有以下数据集:

输入

date        string                                           value
2021-01-01  a=uk_b=goo1_c=brandA_d=phrase_d1 for pedro 2020    20
2021-02-01  a=us_b=goo2_c=brandB_d=phrase_d2 for peter 2020    30
2021-01-15  a=ca_b=goo2_c=brandC_e=102331                      40
2022-01-15  2                                                   0

我需要根据 string 中的值创建一个宽数据框(见下面的输出)。我有数百个名字,这只是一个可重现的例子。

期望的输出

date        a   b     c      d                          e     value 2
2021-01-01  uk  goo1  brandA phrase_d1 for pedro 2020   NA     20  NA
2021-02-01  us  goo2  brandB phrase_d2 for peter 2020   NA     30  NA
2021-01-15  ca  goo2  brandC NA                        102331  40  NA
2022-01-15  NA  NA    NA     NA                         NA      0  NA

什么是巧妙的解决方案?我正在考虑 reshape 和 sub 的组合可能会处理它。

数据

data = data.frame(date =c("2021-01-01","2021-02-01","2021-01-15","2022-01-15"),
                  string = c("a=uk_b=goo1_c=brandA_d=phrase_d1 for pedro 2020",
                             "a=us_b=goo2_c=brandB_d=phrase_d2 for peter 2020",
                             "a=ca_b=goo2_c=brandC_e=102331",2),
                  value = c(20,30,40,0))

@PaulS 的解决方案比我的更简洁,但要求要在变量中打印的字符串中唯一的下划线有一个 d,然后在它们后面有一个数字。如果下划线后面有其他未知模式,解决方案就会中断。这是一个简单的例子:

dat <- tibble::tribble(
  ~date,        ~string,                                           ~value,
"2021-01-01",  "abc=uk_def=goo1_ghi=brandA_jkl=phrase_dx for pedro 2020", 20,
"2021-02-01",  "abc=us_def=goo2_ghi=brandB_jkl=phrase_d2 for peter 2020", 30,
"2021-01-015", "abc=ca_def=goo2_ghi=brandC_mno=102331", 40)

library(stringr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)


dat %>% 
  separate_rows(string, sep="_(?!d\d)") %>% 
  separate(string, into=c("n1", "n2"), sep = "=", fill = "right") %>% 
  pivot_wider(id_cols = c(date, value), names_from = n1, values_from = n2)
#> # A tibble: 3 × 8
#>   date        value abc   def   ghi    jkl                `dx for pedro …` mno  
#>   <chr>       <dbl> <chr> <chr> <chr>  <chr>              <chr>            <chr>
#> 1 2021-01-01     20 uk    goo1  brandA phrase             <NA>             <NA> 
#> 2 2021-02-01     30 us    goo2  brandB phrase_d2 for pet… <NA>             <NA> 
#> 3 2021-01-015    40 ca    goo2  brandC <NA>               <NA>             1023…

我的解决方案有点复杂,但我认为它适用于更广泛的情况:

make_df <- function(string){
  str <- str_split(string, "=", simplify=TRUE)
  if(length(str) == 1){
    nm <- str[1]
    str <- list(NA)
    names(str) <- nm 
  }
  if(length(str) > 1){
    nm <- c(str[1], gsub(".*_(.*?)$", "\1", str[2:(length(str)-1)]))
    str <- str[-1]
    str <- gsub(paste0("_", nm, collapse="|"), "", str)
    str <- as.list(str)
    names(str) <- nm
  }
  do.call(data.frame, str)
}

dat %>% 
  rowwise() %>% 
  mutate(out = make_df(string)) %>% 
  unnest(out) %>% 
  select(-string)
#> # A tibble: 4 × 8
#>   date        value abc   def   ghi    jkl                      mno    X2   
#>   <chr>       <dbl> <chr> <chr> <chr>  <chr>                    <chr>  <lgl>
#> 1 2021-01-01     20 uk    goo1  brandA phrase_dx for pedro 2020 <NA>   NA   
#> 2 2021-02-01     30 us    goo2  brandB phrase_d2 for peter 2020 <NA>   NA   
#> 3 2021-01-015    40 ca    goo2  brandC <NA>                     102331 NA   
#> 4 2021-91-15      0 <NA>  <NA>  <NA>   <NA>                     <NA>   NA

reprex package (v2.0.1)

于 2022-04-08 创建

如果带下划线的字符串像示例中一样规则,@PaulS的解决方案更好。否则,这个可能有用。

另一个可能的解决方案:

library(tidyverse)

data %>% 
  separate_rows(string, sep="_(?!d\d)") %>% 
  separate(string, into=c("n1", "n2"), sep = "=", fill = "right") %>% 
  pivot_wider(id_cols = c(date, value), names_from = n1, values_from = n2)

#> # A tibble: 3 × 7
#>   date       value a     b     c      d                        e     
#>   <chr>      <dbl> <chr> <chr> <chr>  <chr>                    <chr> 
#> 1 2021-01-01    20 uk    goo1  brandA phrase_d1 for pedro 2020 <NA>  
#> 2 2021-02-01    30 us    goo2  brandB phrase_d2 for peter 2020 <NA>  
#> 3 2021-01-15    40 ca    goo2  brandC <NA>                     102331