尝试为 NLP 生成频率会产生不正确的错误

Question

我正在尝试为 NLP 项目生成一些频率和一个语料库，并且运行遇到了 tm 包的问题。我的示例数据来自以下 link 的博客提要：

# specify the source and destination of the download
destination_file <- "Coursera-SwiftKey.zip"
source_file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# load the libraries
library(tm)
library(RWeka)
library(dplyr)
library(magrittr)

# load the sample data
load("sample_data.RData")

# ngram tokaniser
n <- 2L
bigram_token <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
n <- 3L
trigram_token <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))

# check length function
length_is <- function(n) function(x) length(x)==n

# contruct single corpus from sample data
vc_blogs <-
  sample_blogs %>%
  data.frame() %>%
  DataframeSource() %>%
  VCorpus %>%
  tm_map( stripWhitespace )

出现以下错误：

Error in DataframeSource(.) : 
  all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE

是否有修复或变通方法来成功处理这段代码？

Answer 1

根据?DataframeSource

A data frame source interprets each row of the data frame x as a document. The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text" and contain a UTF-8 encoded string representing the document's content. Optional additional columns are used as document level metadata.

在 OP 的示例中，只有一个列，也没有相应地命名

尝试为 NLP 生成频率会产生不正确的错误

Trying to generate frequency for NLP generates not true error

r

tm