根据 r 中的条件拆分字符串

Split string based on condition in r

我正在使用看起来像这样的 table:

library(tidyverse)

id <- c(1, 1, 2, 2)
year <- rep(1990:1991, 2)
occ <- c("former farmer carpenter", "cleaner janitor", "carpenter", "carpenter former cleaner")
old_occ <- c("former farmer", "cleaner", "", "")

df <- tibble(id, year, occ, old_occ)

我想拆分字符串,以便为所有标题分配一个单独的单元格,如下所示:

id occ1            occ2
1  former farmer   carpenter
1  cleaner         janitor
2  carpenter
2  carpenter       former cleaner

现在,如果所有单元格都包含一种职业(如“木匠”)或两种职业(如“清洁工”),这将非常简单。然而,正如您所看到的,一些职业头衔包含有关以前职业的信息,例如“前清洁工”。这些职业头衔由两个字符串组成,可以出现在单元格中当前持有的职业之前或之后。

有没有人建议我如何拆分字符串以获得我想要的结果?

正如评论中的 @GKi 所建议的那样,您可以使用 gsub 使用正则表达式将 "former_*" 标题与 "_" 组合起来。然后strsplitunique化,调整lengthcbind.

df[] <- lapply(df, function(x) gsub("(?<=former)\s", "_", x, perl=TRUE))
tmp <- lapply(strsplit(Reduce(paste, df[c("occ", "old_occ")]), " "), unique)
mxlen <- max(lengths(tmp))
res <- cbind(df[-(3:4)], 
             `colnames<-`(t(sapply(tmp, `length<-`, mxlen)), 
                          paste0("title", sprintf(paste0(".%0", mxlen, "d"), seq(mxlen)))))
res
#   id year      title.01       title.02
# 1  1 1990 former_farmer      carpenter
# 2  1 1991       cleaner        janitor
# 3  2 1990     carpenter           <NA>
# 4  2 1991     carpenter former_cleaner

您可以为每个单词创建一个新行,将 former 与下一个单词值组合并获取宽格式数据。

library(dplyr)
library(tidyr)

df %>%
  mutate(row = row_number()) %>%
  separate_rows(occ) %>%
  group_by(id, row) %>%
  group_by(grp = lag(cumsum(occ != 'former'), default = 0) + 1, .add = TRUE)  %>%
  summarise(occ = paste0(occ, collapse = ' ')) %>%
  pivot_wider(names_from = grp, values_from = occ, names_prefix = 'occ') %>%
  ungroup %>% select(-row)

#     id occ1          occ2          
#  <dbl> <chr>         <chr>         
#1     1 former farmer carpenter     
#2     1 cleaner       janitor       
#3     2 carpenter     NA            
#4     2 carpenter     former cleaner