tidytext 错误（is_corpus_df(corpus) 中的错误：ncol(corpus) >= 2 不是 TRUE）

Question

我正在尝试进行一些基本的文本分析。安装 'tidytext' 包后，我尝试取消嵌套我的数据框，但我一直收到错误。我假设我丢失了一些包裹，但我不确定如何找出哪个包裹。任何建议表示赞赏。

#

library(dplyr)
library(tidytext)


#Import data  
  text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)

  n= nrow(text)

  text_df <- tibble(line = 1:n, text = text)

   text_df %>%
    unnest_tokens(word, text)

> is_corpus_df(corpus) 中的错误：ncol(corpus) >= 2 不是 TRUE

输出：

structure(list(line = 1:6, text = structure(list(text = c("furloughs", "Students do not have their books or needed materials ", "Working MORE for less pay", "None", "Caring for an immuno-compromised spouse", "being a mom, school teacher, researcher and professor" )), class = "data.frame", row.names = c(NA, -6L))), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

Answer 1

您的列 text 实际上是数据框 text_df 中的一个数据框，因此您尝试将 unnest_tokens() 应用于数据框，但只有将其应用于原子向量（字符、整数、双精度、逻辑等）。

要解决此问题，您可以这样做：

library(dplyr)
library(tidytext)

text_df <- text_df %>% 
  mutate_all(as.character) %>% 
  unnest_tokens(word, text)

编辑：

dplyr 现在具有 across 函数，因此 mutate_all 将替换为：

text_df <- text_df %>% 
  mutate(across(everything(), ~as.character(.))) %>% 
  unnest_tokens(word, text)

这给你：

# A tibble: 186 x 2
   line  word     
   <chr> <chr>    
 1 1     c        
 2 1     furloughs
 3 1     students 
 4 1     do       
 5 1     not      
 6 1     have     
 7 1     their    
 8 1     books    
 9 1     or       
10 1     needed   
# ... with 176 more rows

tidytext 错误（is_corpus_df(corpus) 中的错误：ncol(corpus) >= 2 不是 TRUE）

tidytext error (Error in is_corpus_df(corpus) : ncol(corpus) >= 2 is not TRUE)

r

tidytext

#