将文档编号保持在 tidytext 中
keeping document number in tidytext
当我 unnest_tokens 手动输入列表时;输出包括每个单词来自的行号。
library(dplyr)
library(tidytext)
library(tidyr)
library(NLP)
library(tm)
library(SnowballC)
library(widyr)
library(textstem)
#test data
text<- c( "furloughs","Working MORE for less pay", "total burnout and exhaustion")
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
结果是这样的,这就是我想要的。
row_name word
<int> <chr>
1 1 furlough
2 2 work
3 2 more
4 2 for
5 2 less
6 2 pai
7 3 total
8 3 burnout
9 3 and
10 3 exhaust
但是当我尝试从 csv 文件中读取真实响应时:
#Import data
text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)
但在其他方面使用相同的代码:
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
我将整个标记列表分配给第 1 行,然后再次分配给第 2 行,依此类推。
row_name word
<int> <chr>
1 1 c
2 1 furlough
3 1 work
4 1 more
5 1 for
6 1 less
7 1 pai
8 1 total
9 1 burnout
10 1 and
或者,如果我将 mutate(row_name = row_number) 移动到 unnest 命令之后,我会得到每个标记的行号。
word row_name
<chr> <int>
1 c 1
2 furlough 2
3 work 3
4 more 4
5 for 5
6 less 6
7 pai 7
8 total 8
9 burnout 9
10 and 10
我错过了什么?
我想如果你使用 text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)
导入文本,text 是一个数据框,而如果你手动输入它是一个向量。
如果您将代码更改为:text_df <- tibble(text = text$col_name)
到 select 在 csv 情况下数据框(它是一个向量)的列,我认为您应该得到相同的结果前。
当我 unnest_tokens 手动输入列表时;输出包括每个单词来自的行号。
library(dplyr)
library(tidytext)
library(tidyr)
library(NLP)
library(tm)
library(SnowballC)
library(widyr)
library(textstem)
#test data
text<- c( "furloughs","Working MORE for less pay", "total burnout and exhaustion")
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
结果是这样的,这就是我想要的。
row_name word
<int> <chr>
1 1 furlough
2 2 work
3 2 more
4 2 for
5 2 less
6 2 pai
7 3 total
8 3 burnout
9 3 and
10 3 exhaust
但是当我尝试从 csv 文件中读取真实响应时:
#Import data
text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)
但在其他方面使用相同的代码:
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
我将整个标记列表分配给第 1 行,然后再次分配给第 2 行,依此类推。
row_name word
<int> <chr>
1 1 c
2 1 furlough
3 1 work
4 1 more
5 1 for
6 1 less
7 1 pai
8 1 total
9 1 burnout
10 1 and
或者,如果我将 mutate(row_name = row_number) 移动到 unnest 命令之后,我会得到每个标记的行号。
word row_name
<chr> <int>
1 c 1
2 furlough 2
3 work 3
4 more 4
5 for 5
6 less 6
7 pai 7
8 total 8
9 burnout 9
10 and 10
我错过了什么?
我想如果你使用 text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)
导入文本,text 是一个数据框,而如果你手动输入它是一个向量。
如果您将代码更改为:text_df <- tibble(text = text$col_name)
到 select 在 csv 情况下数据框(它是一个向量)的列,我认为您应该得到相同的结果前。