根据 r 中的条件拆分字符串
Split string based on condition in r
我正在使用看起来像这样的 table:
library(tidyverse)
id <- c(1, 1, 2, 2)
year <- rep(1990:1991, 2)
occ <- c("former farmer carpenter", "cleaner janitor", "carpenter", "carpenter former cleaner")
old_occ <- c("former farmer", "cleaner", "", "")
df <- tibble(id, year, occ, old_occ)
我想拆分字符串,以便为所有标题分配一个单独的单元格,如下所示:
id occ1 occ2
1 former farmer carpenter
1 cleaner janitor
2 carpenter
2 carpenter former cleaner
现在,如果所有单元格都包含一种职业(如“木匠”)或两种职业(如“清洁工”),这将非常简单。然而,正如您所看到的,一些职业头衔包含有关以前职业的信息,例如“前清洁工”。这些职业头衔由两个字符串组成,可以出现在单元格中当前持有的职业之前或之后。
有没有人建议我如何拆分字符串以获得我想要的结果?
正如评论中的 @GKi 所建议的那样,您可以使用 gsub
使用正则表达式将 "former_*"
标题与 "_"
组合起来。然后strsplit
和unique
化,调整length
和cbind
.
df[] <- lapply(df, function(x) gsub("(?<=former)\s", "_", x, perl=TRUE))
tmp <- lapply(strsplit(Reduce(paste, df[c("occ", "old_occ")]), " "), unique)
mxlen <- max(lengths(tmp))
res <- cbind(df[-(3:4)],
`colnames<-`(t(sapply(tmp, `length<-`, mxlen)),
paste0("title", sprintf(paste0(".%0", mxlen, "d"), seq(mxlen)))))
res
# id year title.01 title.02
# 1 1 1990 former_farmer carpenter
# 2 1 1991 cleaner janitor
# 3 2 1990 carpenter <NA>
# 4 2 1991 carpenter former_cleaner
您可以为每个单词创建一个新行,将 former
与下一个单词值组合并获取宽格式数据。
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
separate_rows(occ) %>%
group_by(id, row) %>%
group_by(grp = lag(cumsum(occ != 'former'), default = 0) + 1, .add = TRUE) %>%
summarise(occ = paste0(occ, collapse = ' ')) %>%
pivot_wider(names_from = grp, values_from = occ, names_prefix = 'occ') %>%
ungroup %>% select(-row)
# id occ1 occ2
# <dbl> <chr> <chr>
#1 1 former farmer carpenter
#2 1 cleaner janitor
#3 2 carpenter NA
#4 2 carpenter former cleaner
我正在使用看起来像这样的 table:
library(tidyverse)
id <- c(1, 1, 2, 2)
year <- rep(1990:1991, 2)
occ <- c("former farmer carpenter", "cleaner janitor", "carpenter", "carpenter former cleaner")
old_occ <- c("former farmer", "cleaner", "", "")
df <- tibble(id, year, occ, old_occ)
我想拆分字符串,以便为所有标题分配一个单独的单元格,如下所示:
id occ1 occ2
1 former farmer carpenter
1 cleaner janitor
2 carpenter
2 carpenter former cleaner
现在,如果所有单元格都包含一种职业(如“木匠”)或两种职业(如“清洁工”),这将非常简单。然而,正如您所看到的,一些职业头衔包含有关以前职业的信息,例如“前清洁工”。这些职业头衔由两个字符串组成,可以出现在单元格中当前持有的职业之前或之后。
有没有人建议我如何拆分字符串以获得我想要的结果?
正如评论中的 @GKi 所建议的那样,您可以使用 gsub
使用正则表达式将 "former_*"
标题与 "_"
组合起来。然后strsplit
和unique
化,调整length
和cbind
.
df[] <- lapply(df, function(x) gsub("(?<=former)\s", "_", x, perl=TRUE))
tmp <- lapply(strsplit(Reduce(paste, df[c("occ", "old_occ")]), " "), unique)
mxlen <- max(lengths(tmp))
res <- cbind(df[-(3:4)],
`colnames<-`(t(sapply(tmp, `length<-`, mxlen)),
paste0("title", sprintf(paste0(".%0", mxlen, "d"), seq(mxlen)))))
res
# id year title.01 title.02
# 1 1 1990 former_farmer carpenter
# 2 1 1991 cleaner janitor
# 3 2 1990 carpenter <NA>
# 4 2 1991 carpenter former_cleaner
您可以为每个单词创建一个新行,将 former
与下一个单词值组合并获取宽格式数据。
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
separate_rows(occ) %>%
group_by(id, row) %>%
group_by(grp = lag(cumsum(occ != 'former'), default = 0) + 1, .add = TRUE) %>%
summarise(occ = paste0(occ, collapse = ' ')) %>%
pivot_wider(names_from = grp, values_from = occ, names_prefix = 'occ') %>%
ungroup %>% select(-row)
# id occ1 occ2
# <dbl> <chr> <chr>
#1 1 former farmer carpenter
#2 1 cleaner janitor
#3 2 carpenter NA
#4 2 carpenter former cleaner