将文档编号保持在 tidytext 中

keeping document number in tidytext

当我 unnest_tokens 手动输入列表时;输出包括每个单词来自的行号。

library(dplyr)
library(tidytext)
library(tidyr)
library(NLP)
library(tm)
library(SnowballC)
library(widyr)
library(textstem)


#test data
text<- c( "furloughs","Working MORE for less pay",  "total burnout and exhaustion")

#break text file into single words and list which row they are in
  text_df <- tibble(text = text)

  tidy_text <- text_df %>% 
    mutate_all(as.character) %>% 
    mutate(row_name = row_number())%>%    
    unnest_tokens(word, text) %>%
    mutate(word = wordStem(word))

结果是这样的,这就是我想要的。

   row_name word    
      <int> <chr>   
 1        1 furlough
 2        2 work    
 3        2 more    
 4        2 for     
 5        2 less    
 6        2 pai     
 7        3 total   
 8        3 burnout 
 9        3 and     
10        3 exhaust

但是当我尝试从 csv 文件中读取真实响应时:

#Import data  
 text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)

但在其他方面使用相同的代码:

#break text file into single words and list which row they are in
  text_df <- tibble(text = text)

  tidy_text <- text_df %>% 
    mutate_all(as.character) %>% 
    mutate(row_name = row_number())%>%

    unnest_tokens(word, text) %>%

    mutate(word = wordStem(word)) 

我将整个标记列表分配给第 1 行,然后再次分配给第 2 行,依此类推。

   row_name word    
      <int> <chr>   
 1        1 c       
 2        1 furlough
 3        1 work    
 4        1 more    
 5        1 for     
 6        1 less    
 7        1 pai     
 8        1 total   
 9        1 burnout 
10        1 and   

或者,如果我将 mutate(row_name = row_number) 移动到 unnest 命令之后,我会得到每个标记的行号。

   word     row_name
   <chr>       <int>
 1 c               1
 2 furlough        2
 3 work            3
 4 more            4
 5 for             5
 6 less            6
 7 pai             7
 8 total           8
 9 burnout         9
10 and            10

我错过了什么?

我想如果你使用 text <- read.csv("TextSample.csv", stringsAsFactors=FALSE) 导入文本,text 是一个数据框,而如果你手动输入它是一个向量。

如果您将代码更改为:text_df <- tibble(text = text$col_name) 到 select 在 csv 情况下数据框(它是一个向量)的列,我认为您应该得到相同的结果前。