在 R Dataframe 列中拆分单词
Split words in R Dataframe column
我有一个数据框,其中一列中的单词由单个 space 分隔。我想将其分为以下三种类型。数据框如下所示。
Text
one of the
i want to
我想拆分成如下。
Text split1 split2 split3
one of the one one of of the
我能拿到第一。无法弄清楚其他两个。
我获取 split1 的代码:
new_data$split1<-sub(" .*","",new_data$Text)
找出 split2:
df$split2 <- gsub(" [^ ]*$", "", df$Text)
可能会有更优雅的解决方案。这里有两个选项:
使用ngrams
:
library(dplyr); library(tm)
df %>% mutate(splits = strsplit(Text, "\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, function(words) ngrams(words, 2)[[1]]),
split3 = lapply(splits, function(words) ngrams(words, 2)[[2]])) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
手动提取两个克:
df %>% mutate(splits = strsplit(Text, "\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, `[`, 1:2),
split3 = lapply(splits, `[`, 2:3)) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
更新:
有了正则表达式,我们就可以使用gsub的反向引用。
拆分 2:
gsub("((.*)\s+(.*))\s+(.*)", "\1", df$Text)
[1] "one of" "i want"
拆分 3:
gsub("(.*)\s+((.*)\s+(.*))", "\2", df$Text)
[1] "of the" "want to"
我们可以试试gsub
。捕获一个或多个非白色space(\S+
)作为一组(在本例中有3个词),然后在替换中,我们重新排列反向引用并插入一个分隔符(,
) 我们用 read.table
.
转换成不同的列
df1[paste0("split", 1:3)] <- read.table(text=gsub("(\S+)\s+(\S+)\s+(\S+)",
"\1,\1 \2,\2 \3", df1$Text), sep=",")
df1
# Text split1 split2 split3
#1 one of the one one of of the
#2 i want to i i want want to
数据
df1 <- structure(list(Text = c("one of the", "i want to")),
.Names = "Text", class = "data.frame", row.names = c(NA, -2L))
这个解决方案有点老套。
假设 :- 你不关心两个词之间的空格数。
> library(stringr)
> x<-c('one of the','i want to')
> strsplit(gsub('(\S+)\s+(\S+)\s+(.*)', '\1 \1 \2 \2 \3', x), '\s\s+')
#[[1]]
#[1] "one" "one of" "of the"
#[[2]]
#[1] "i" "i want" "want to"
我有一个数据框,其中一列中的单词由单个 space 分隔。我想将其分为以下三种类型。数据框如下所示。
Text
one of the
i want to
我想拆分成如下。
Text split1 split2 split3
one of the one one of of the
我能拿到第一。无法弄清楚其他两个。
我获取 split1 的代码:
new_data$split1<-sub(" .*","",new_data$Text)
找出 split2:
df$split2 <- gsub(" [^ ]*$", "", df$Text)
可能会有更优雅的解决方案。这里有两个选项:
使用ngrams
:
library(dplyr); library(tm)
df %>% mutate(splits = strsplit(Text, "\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, function(words) ngrams(words, 2)[[1]]),
split3 = lapply(splits, function(words) ngrams(words, 2)[[2]])) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
手动提取两个克:
df %>% mutate(splits = strsplit(Text, "\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, `[`, 1:2),
split3 = lapply(splits, `[`, 2:3)) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
更新:
有了正则表达式,我们就可以使用gsub的反向引用。
拆分 2:
gsub("((.*)\s+(.*))\s+(.*)", "\1", df$Text)
[1] "one of" "i want"
拆分 3:
gsub("(.*)\s+((.*)\s+(.*))", "\2", df$Text)
[1] "of the" "want to"
我们可以试试gsub
。捕获一个或多个非白色space(\S+
)作为一组(在本例中有3个词),然后在替换中,我们重新排列反向引用并插入一个分隔符(,
) 我们用 read.table
.
df1[paste0("split", 1:3)] <- read.table(text=gsub("(\S+)\s+(\S+)\s+(\S+)",
"\1,\1 \2,\2 \3", df1$Text), sep=",")
df1
# Text split1 split2 split3
#1 one of the one one of of the
#2 i want to i i want want to
数据
df1 <- structure(list(Text = c("one of the", "i want to")),
.Names = "Text", class = "data.frame", row.names = c(NA, -2L))
这个解决方案有点老套。
假设 :- 你不关心两个词之间的空格数。
> library(stringr)
> x<-c('one of the','i want to')
> strsplit(gsub('(\S+)\s+(\S+)\s+(.*)', '\1 \1 \2 \2 \3', x), '\s\s+')
#[[1]]
#[1] "one" "one of" "of the"
#[[2]]
#[1] "i" "i want" "want to"