杂乱的字符串列到宽格式
Messy string column to wide format
我有以下数据集:
输入
date string value
2021-01-01 a=uk_b=goo1_c=brandA_d=phrase_d1 for pedro 2020 20
2021-02-01 a=us_b=goo2_c=brandB_d=phrase_d2 for peter 2020 30
2021-01-15 a=ca_b=goo2_c=brandC_e=102331 40
2022-01-15 2 0
我需要根据 string
中的值创建一个宽数据框(见下面的输出)。我有数百个名字,这只是一个可重现的例子。
期望的输出
date a b c d e value 2
2021-01-01 uk goo1 brandA phrase_d1 for pedro 2020 NA 20 NA
2021-02-01 us goo2 brandB phrase_d2 for peter 2020 NA 30 NA
2021-01-15 ca goo2 brandC NA 102331 40 NA
2022-01-15 NA NA NA NA NA 0 NA
什么是巧妙的解决方案?我正在考虑 reshape 和 sub 的组合可能会处理它。
数据
data = data.frame(date =c("2021-01-01","2021-02-01","2021-01-15","2022-01-15"),
string = c("a=uk_b=goo1_c=brandA_d=phrase_d1 for pedro 2020",
"a=us_b=goo2_c=brandB_d=phrase_d2 for peter 2020",
"a=ca_b=goo2_c=brandC_e=102331",2),
value = c(20,30,40,0))
@PaulS 的解决方案比我的更简洁,但要求要在变量中打印的字符串中唯一的下划线有一个 d
,然后在它们后面有一个数字。如果下划线后面有其他未知模式,解决方案就会中断。这是一个简单的例子:
dat <- tibble::tribble(
~date, ~string, ~value,
"2021-01-01", "abc=uk_def=goo1_ghi=brandA_jkl=phrase_dx for pedro 2020", 20,
"2021-02-01", "abc=us_def=goo2_ghi=brandB_jkl=phrase_d2 for peter 2020", 30,
"2021-01-015", "abc=ca_def=goo2_ghi=brandC_mno=102331", 40)
library(stringr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
dat %>%
separate_rows(string, sep="_(?!d\d)") %>%
separate(string, into=c("n1", "n2"), sep = "=", fill = "right") %>%
pivot_wider(id_cols = c(date, value), names_from = n1, values_from = n2)
#> # A tibble: 3 × 8
#> date value abc def ghi jkl `dx for pedro …` mno
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2021-01-01 20 uk goo1 brandA phrase <NA> <NA>
#> 2 2021-02-01 30 us goo2 brandB phrase_d2 for pet… <NA> <NA>
#> 3 2021-01-015 40 ca goo2 brandC <NA> <NA> 1023…
我的解决方案有点复杂,但我认为它适用于更广泛的情况:
make_df <- function(string){
str <- str_split(string, "=", simplify=TRUE)
if(length(str) == 1){
nm <- str[1]
str <- list(NA)
names(str) <- nm
}
if(length(str) > 1){
nm <- c(str[1], gsub(".*_(.*?)$", "\1", str[2:(length(str)-1)]))
str <- str[-1]
str <- gsub(paste0("_", nm, collapse="|"), "", str)
str <- as.list(str)
names(str) <- nm
}
do.call(data.frame, str)
}
dat %>%
rowwise() %>%
mutate(out = make_df(string)) %>%
unnest(out) %>%
select(-string)
#> # A tibble: 4 × 8
#> date value abc def ghi jkl mno X2
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <lgl>
#> 1 2021-01-01 20 uk goo1 brandA phrase_dx for pedro 2020 <NA> NA
#> 2 2021-02-01 30 us goo2 brandB phrase_d2 for peter 2020 <NA> NA
#> 3 2021-01-015 40 ca goo2 brandC <NA> 102331 NA
#> 4 2021-91-15 0 <NA> <NA> <NA> <NA> <NA> NA
由 reprex package (v2.0.1)
于 2022-04-08 创建
如果带下划线的字符串像示例中一样规则,@PaulS的解决方案更好。否则,这个可能有用。
另一个可能的解决方案:
library(tidyverse)
data %>%
separate_rows(string, sep="_(?!d\d)") %>%
separate(string, into=c("n1", "n2"), sep = "=", fill = "right") %>%
pivot_wider(id_cols = c(date, value), names_from = n1, values_from = n2)
#> # A tibble: 3 × 7
#> date value a b c d e
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 2021-01-01 20 uk goo1 brandA phrase_d1 for pedro 2020 <NA>
#> 2 2021-02-01 30 us goo2 brandB phrase_d2 for peter 2020 <NA>
#> 3 2021-01-15 40 ca goo2 brandC <NA> 102331
我有以下数据集:
输入
date string value
2021-01-01 a=uk_b=goo1_c=brandA_d=phrase_d1 for pedro 2020 20
2021-02-01 a=us_b=goo2_c=brandB_d=phrase_d2 for peter 2020 30
2021-01-15 a=ca_b=goo2_c=brandC_e=102331 40
2022-01-15 2 0
我需要根据 string
中的值创建一个宽数据框(见下面的输出)。我有数百个名字,这只是一个可重现的例子。
期望的输出
date a b c d e value 2
2021-01-01 uk goo1 brandA phrase_d1 for pedro 2020 NA 20 NA
2021-02-01 us goo2 brandB phrase_d2 for peter 2020 NA 30 NA
2021-01-15 ca goo2 brandC NA 102331 40 NA
2022-01-15 NA NA NA NA NA 0 NA
什么是巧妙的解决方案?我正在考虑 reshape 和 sub 的组合可能会处理它。
数据
data = data.frame(date =c("2021-01-01","2021-02-01","2021-01-15","2022-01-15"),
string = c("a=uk_b=goo1_c=brandA_d=phrase_d1 for pedro 2020",
"a=us_b=goo2_c=brandB_d=phrase_d2 for peter 2020",
"a=ca_b=goo2_c=brandC_e=102331",2),
value = c(20,30,40,0))
@PaulS 的解决方案比我的更简洁,但要求要在变量中打印的字符串中唯一的下划线有一个 d
,然后在它们后面有一个数字。如果下划线后面有其他未知模式,解决方案就会中断。这是一个简单的例子:
dat <- tibble::tribble(
~date, ~string, ~value,
"2021-01-01", "abc=uk_def=goo1_ghi=brandA_jkl=phrase_dx for pedro 2020", 20,
"2021-02-01", "abc=us_def=goo2_ghi=brandB_jkl=phrase_d2 for peter 2020", 30,
"2021-01-015", "abc=ca_def=goo2_ghi=brandC_mno=102331", 40)
library(stringr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
dat %>%
separate_rows(string, sep="_(?!d\d)") %>%
separate(string, into=c("n1", "n2"), sep = "=", fill = "right") %>%
pivot_wider(id_cols = c(date, value), names_from = n1, values_from = n2)
#> # A tibble: 3 × 8
#> date value abc def ghi jkl `dx for pedro …` mno
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2021-01-01 20 uk goo1 brandA phrase <NA> <NA>
#> 2 2021-02-01 30 us goo2 brandB phrase_d2 for pet… <NA> <NA>
#> 3 2021-01-015 40 ca goo2 brandC <NA> <NA> 1023…
我的解决方案有点复杂,但我认为它适用于更广泛的情况:
make_df <- function(string){
str <- str_split(string, "=", simplify=TRUE)
if(length(str) == 1){
nm <- str[1]
str <- list(NA)
names(str) <- nm
}
if(length(str) > 1){
nm <- c(str[1], gsub(".*_(.*?)$", "\1", str[2:(length(str)-1)]))
str <- str[-1]
str <- gsub(paste0("_", nm, collapse="|"), "", str)
str <- as.list(str)
names(str) <- nm
}
do.call(data.frame, str)
}
dat %>%
rowwise() %>%
mutate(out = make_df(string)) %>%
unnest(out) %>%
select(-string)
#> # A tibble: 4 × 8
#> date value abc def ghi jkl mno X2
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <lgl>
#> 1 2021-01-01 20 uk goo1 brandA phrase_dx for pedro 2020 <NA> NA
#> 2 2021-02-01 30 us goo2 brandB phrase_d2 for peter 2020 <NA> NA
#> 3 2021-01-015 40 ca goo2 brandC <NA> 102331 NA
#> 4 2021-91-15 0 <NA> <NA> <NA> <NA> <NA> NA
由 reprex package (v2.0.1)
于 2022-04-08 创建如果带下划线的字符串像示例中一样规则,@PaulS的解决方案更好。否则,这个可能有用。
另一个可能的解决方案:
library(tidyverse)
data %>%
separate_rows(string, sep="_(?!d\d)") %>%
separate(string, into=c("n1", "n2"), sep = "=", fill = "right") %>%
pivot_wider(id_cols = c(date, value), names_from = n1, values_from = n2)
#> # A tibble: 3 × 7
#> date value a b c d e
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 2021-01-01 20 uk goo1 brandA phrase_d1 for pedro 2020 <NA>
#> 2 2021-02-01 30 us goo2 brandB phrase_d2 for peter 2020 <NA>
#> 3 2021-01-15 40 ca goo2 brandC <NA> 102331