在 R Dataframe 列中拆分单词

Split words in R Dataframe column

我有一个数据框,其中一列中的单词由单个 space 分隔。我想将其分为以下三种类型。数据框如下所示。

Text
one of the
i want to

我想拆分成如下。

Text         split1     split2    split3
one of the    one       one of     of the

我能拿到第一。无法弄清楚其他两个。

我获取 split1 的代码:

new_data$split1<-sub(" .*","",new_data$Text)

找出 split2:

df$split2 <- gsub(" [^ ]*$", "", df$Text)

可能会有更优雅的解决方案。这里有两个选项:

使用ngrams:

library(dplyr); library(tm)
df %>% mutate(splits = strsplit(Text, "\s+")) %>% 
       mutate(split1 = lapply(splits, `[`, 1)) %>% 
       mutate(split2 = lapply(splits, function(words) ngrams(words, 2)[[1]]), 
              split3 = lapply(splits, function(words) ngrams(words, 2)[[2]])) %>% 
       select(-splits)

        Text split1  split2   split3
1 one of the    one one, of  of, the
2  i want to      i i, want want, to

手动提取两个克:

df %>% mutate(splits = strsplit(Text, "\s+")) %>% 
       mutate(split1 = lapply(splits, `[`, 1)) %>% 
       mutate(split2 = lapply(splits, `[`, 1:2), 
              split3 = lapply(splits, `[`, 2:3)) %>% 
       select(-splits)

        Text split1  split2   split3
1 one of the    one one, of  of, the
2  i want to      i i, want want, to

更新:

有了正则表达式,我们就可以使用gsub的反向引用。

拆分 2:

gsub("((.*)\s+(.*))\s+(.*)", "\1", df$Text)
[1] "one of" "i want"

拆分 3:

gsub("(.*)\s+((.*)\s+(.*))", "\2", df$Text)
[1] "of the"  "want to"

我们可以试试gsub。捕获一个或多个非白色space(\S+)作为一组(在本例中有3个词),然后在替换中,我们重新排列反向引用并插入一个分隔符(,) 我们用 read.table.

转换成不同的列
 df1[paste0("split", 1:3)] <- read.table(text=gsub("(\S+)\s+(\S+)\s+(\S+)", 
                  "\1,\1 \2,\2 \3", df1$Text), sep=",")
df1
#        Text split1 split2  split3
#1 one of the    one one of  of the
#2  i want to      i i want want to

数据

df1 <- structure(list(Text = c("one of the", "i want to")), 
.Names = "Text", class = "data.frame", row.names = c(NA, -2L))

这个解决方案有点老套。

假设 :- 你不关心两个词之间的空格数。

> library(stringr)
> x<-c('one of the','i want to')
> strsplit(gsub('(\S+)\s+(\S+)\s+(.*)', '\1  \1 \2   \2 \3', x), '\s\s+')
#[[1]]
#[1] "one"    "one of" "of the"

#[[2]]
#[1] "i"       "i want"  "want to"