如何 "split" R 中的文本文档或文本字符串，以便每个单词在数据框中都是它自己的行？

Question

documents <- c("This is document number one", "document two is the second element of the vector")

我要创建的数据框是：

idealdf <- c("this", "is", "document", "number", "one", "document", "two", "is", "the", "second", "element", "of", "the", "vector")

我一直在使用 tm 包将我的文档转换为语料库并删除标点符号，转换为小写等，方法如下：

#create a corpus:
myCorpus <- Corpus(VectorSource(documents))

#convert to lowercase:
myCorpus <- tm_map(myCorpus, content_transformer(tolower))

#remove punctuation:
myCorpus <- tm_map(myCorpus, removePunctuation)

...但我在尝试将其放入 df 时遇到了麻烦，其中每个单词都有自己的行（我更喜欢每个单词都有自己的行 - 即使同一个单词显示为多行） .

谢谢。

Answer 1

怎么样

library(stringi)
data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(documents))))
#       words
# 1      this
# 2        is
# 3  document
# 4    number
# 5       one
# 6  document
# 7       two
# 8        is
# 9       the
# 10   second
# 11  element
# 12       of
# 13      the
# 14   vector

Answer 2

好吧，为了将所有单词堆叠在一个向量中，我会使用 stringr::str_match_all 这种方式：

> documents <- c("This is document number one", "document two is the second element of the vector")
> str_match_all(documents, '\w+\b')
[[1]]
     [,1]      
[1,] "This"    
[2,] "is"      
[3,] "document"
[4,] "number"  
[5,] "one"     

[[2]]
      [,1]      
 [1,] "document"
 [2,] "two"     
 [3,] "is"      
 [4,] "the"     
 [5,] "second"  
 [6,] "element" 
 [7,] "of"      
 [8,] "the"     
 [9,] "vector"  

> unlist(str_match_all(documents, '\w+\b'))
 [1] "This"     "is"       "document" "number"   "one"      "document" "two"      "is"       "the"      "second"   "element"  "of"       "the"      "vector"  
> length(unlist(str_match_all(documents, '\w+\b')))
[1] 14
> do.call(rbind, str_match_all(documents, '\w+\b'))
      [,1]      
 [1,] "This"    
 [2,] "is"      
 [3,] "document"
 [4,] "number"  
 [5,] "one"     
 [6,] "document"
 [7,] "two"     
 [8,] "is"      
 [9,] "the"     
[10,] "second"  
[11,] "element" 
[12,] "of"      
[13,] "the"     
[14,] "vector"

我认为它可以解决您的问题，但取决于字数，我不确定它是否有效。

如何 "split" R 中的文本文档或文本字符串，以便每个单词在数据框中都是它自己的行？

How to "split" a text document or string of text in R so that each word is its own row in a dataframe?

r

corpus

text-mining

tm