如何 "split" R 中的文本文档或文本字符串,以便每个单词在数据框中都是它自己的行?
How to "split" a text document or string of text in R so that each word is its own row in a dataframe?
documents <- c("This is document number one", "document two is the second element of the vector")
我要创建的数据框是:
idealdf <- c("this", "is", "document", "number", "one", "document", "two", "is", "the", "second", "element", "of", "the", "vector")
我一直在使用 tm 包将我的文档转换为语料库并删除标点符号,转换为小写等,方法如下:
#create a corpus:
myCorpus <- Corpus(VectorSource(documents))
#convert to lowercase:
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
#remove punctuation:
myCorpus <- tm_map(myCorpus, removePunctuation)
...但我在尝试将其放入 df 时遇到了麻烦,其中每个单词都有自己的行(我更喜欢每个单词都有自己的行 - 即使同一个单词显示为多行) .
谢谢。
怎么样
library(stringi)
data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(documents))))
# words
# 1 this
# 2 is
# 3 document
# 4 number
# 5 one
# 6 document
# 7 two
# 8 is
# 9 the
# 10 second
# 11 element
# 12 of
# 13 the
# 14 vector
好吧,为了将所有单词堆叠在一个向量中,我会使用 stringr::str_match_all
这种方式:
> documents <- c("This is document number one", "document two is the second element of the vector")
> str_match_all(documents, '\w+\b')
[[1]]
[,1]
[1,] "This"
[2,] "is"
[3,] "document"
[4,] "number"
[5,] "one"
[[2]]
[,1]
[1,] "document"
[2,] "two"
[3,] "is"
[4,] "the"
[5,] "second"
[6,] "element"
[7,] "of"
[8,] "the"
[9,] "vector"
> unlist(str_match_all(documents, '\w+\b'))
[1] "This" "is" "document" "number" "one" "document" "two" "is" "the" "second" "element" "of" "the" "vector"
> length(unlist(str_match_all(documents, '\w+\b')))
[1] 14
> do.call(rbind, str_match_all(documents, '\w+\b'))
[,1]
[1,] "This"
[2,] "is"
[3,] "document"
[4,] "number"
[5,] "one"
[6,] "document"
[7,] "two"
[8,] "is"
[9,] "the"
[10,] "second"
[11,] "element"
[12,] "of"
[13,] "the"
[14,] "vector"
我认为它可以解决您的问题,但取决于字数,我不确定它是否有效。
documents <- c("This is document number one", "document two is the second element of the vector")
我要创建的数据框是:
idealdf <- c("this", "is", "document", "number", "one", "document", "two", "is", "the", "second", "element", "of", "the", "vector")
我一直在使用 tm 包将我的文档转换为语料库并删除标点符号,转换为小写等,方法如下:
#create a corpus:
myCorpus <- Corpus(VectorSource(documents))
#convert to lowercase:
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
#remove punctuation:
myCorpus <- tm_map(myCorpus, removePunctuation)
...但我在尝试将其放入 df 时遇到了麻烦,其中每个单词都有自己的行(我更喜欢每个单词都有自己的行 - 即使同一个单词显示为多行) .
谢谢。
怎么样
library(stringi)
data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(documents))))
# words
# 1 this
# 2 is
# 3 document
# 4 number
# 5 one
# 6 document
# 7 two
# 8 is
# 9 the
# 10 second
# 11 element
# 12 of
# 13 the
# 14 vector
好吧,为了将所有单词堆叠在一个向量中,我会使用 stringr::str_match_all
这种方式:
> documents <- c("This is document number one", "document two is the second element of the vector")
> str_match_all(documents, '\w+\b')
[[1]]
[,1]
[1,] "This"
[2,] "is"
[3,] "document"
[4,] "number"
[5,] "one"
[[2]]
[,1]
[1,] "document"
[2,] "two"
[3,] "is"
[4,] "the"
[5,] "second"
[6,] "element"
[7,] "of"
[8,] "the"
[9,] "vector"
> unlist(str_match_all(documents, '\w+\b'))
[1] "This" "is" "document" "number" "one" "document" "two" "is" "the" "second" "element" "of" "the" "vector"
> length(unlist(str_match_all(documents, '\w+\b')))
[1] 14
> do.call(rbind, str_match_all(documents, '\w+\b'))
[,1]
[1,] "This"
[2,] "is"
[3,] "document"
[4,] "number"
[5,] "one"
[6,] "document"
[7,] "two"
[8,] "is"
[9,] "the"
[10,] "second"
[11,] "element"
[12,] "of"
[13,] "the"
[14,] "vector"
我认为它可以解决您的问题,但取决于字数,我不确定它是否有效。