如何标记 R 中的文本列? unnest 函数不起作用
How can I tokenize a text column in R? unnest function not working
我是 R 的新用户。如果您能帮我解决标记化问题,我将不胜感激:
我的任务简介:
我正在尝试将文本文件导入 R。其中一个文本列是标题。该数据集基本上是 collection 与疾病相关的新闻文章。
问题:
我曾多次尝试使用 unnest_tokens 函数对其进行标记化。
它向我显示以下错误消息:
UseMethod("unnest_tokens_") 错误:
没有适用于 'unnest_tokens_' 的方法应用于 class object "character"
unnest_tokens(单词,标题)错误:object 'word' 未找到
library(dplyr)
library(tidytext)
DengueNews %>%
unnest_tokens(word, Headline)
注意:
Link 个数据集:https://drive.google.com/file/d/18VWg-2sO11GpwxMGF1UbziodoWK9B9Ru/view?usp=sharing
我正在按照 https://www.tidytextmining.com/tidytext.html
的说明进行操作
不清楚数据是如何读取的。正如评论中提到的,如果数据列 'Headline' 是 character
class,它应该可以工作。在这里,我们使用 readxl
包中的 read_excl
来读取数据集。默认情况下,character
的列将返回 character
class 属性。
library(readxl)
library(tidytext)
DengueNews <- read_excel("DengueNews.xlsx")
class(DengueNew$Headline)
#[1] "character"
DengueNews %>%
unnest_tokens(word, Headline)
# A tibble: 217 x 4
Serial Date Newscontent word
<dbl> <chr> <chr> <chr>
1 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… dghs
2 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… 491
3 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… more
4 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… hospitali…
5 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… for
6 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… dengue
7 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… in
8 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… 24hrs
9 215 43725 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA fifth-grader schoolgirl has died of dengue fever at Dhaka Medical College a… 1
10 215 43725 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA fifth-grader schoolgirl has died of dengue fever at Dhaka Medical College a… more
# … with 207 more rows
如果我们将列 class 更改为另一个 class
factor
,它将失败
library(dplyr)
DengueNews %>%
mutate(Headline = factor(Headline)) %>%
unnest_tokens(word, Healine)
我是 R 的新用户。如果您能帮我解决标记化问题,我将不胜感激:
我的任务简介: 我正在尝试将文本文件导入 R。其中一个文本列是标题。该数据集基本上是 collection 与疾病相关的新闻文章。
问题: 我曾多次尝试使用 unnest_tokens 函数对其进行标记化。
它向我显示以下错误消息:
UseMethod("unnest_tokens_") 错误: 没有适用于 'unnest_tokens_' 的方法应用于 class object "character"
unnest_tokens(单词,标题)错误:object 'word' 未找到
library(dplyr)
library(tidytext)
DengueNews %>%
unnest_tokens(word, Headline)
注意: Link 个数据集:https://drive.google.com/file/d/18VWg-2sO11GpwxMGF1UbziodoWK9B9Ru/view?usp=sharing 我正在按照 https://www.tidytextmining.com/tidytext.html
的说明进行操作不清楚数据是如何读取的。正如评论中提到的,如果数据列 'Headline' 是 character
class,它应该可以工作。在这里,我们使用 readxl
包中的 read_excl
来读取数据集。默认情况下,character
的列将返回 character
class 属性。
library(readxl)
library(tidytext)
DengueNews <- read_excel("DengueNews.xlsx")
class(DengueNew$Headline)
#[1] "character"
DengueNews %>%
unnest_tokens(word, Headline)
# A tibble: 217 x 4
Serial Date Newscontent word
<dbl> <chr> <chr> <chr>
1 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… dghs
2 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… 491
3 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… more
4 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… hospitali…
5 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… for
6 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… dengue
7 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… in
8 216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… 24hrs
9 215 43725 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA fifth-grader schoolgirl has died of dengue fever at Dhaka Medical College a… 1
10 215 43725 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA fifth-grader schoolgirl has died of dengue fever at Dhaka Medical College a… more
# … with 207 more rows
如果我们将列 class 更改为另一个 class
factor
,它将失败
library(dplyr)
DengueNews %>%
mutate(Headline = factor(Headline)) %>%
unnest_tokens(word, Healine)