如何在 twitter 文本数据上使用 unnest_token?

How to use unnest_token on twitter text data?

我正在尝试 运行 以下内容并给我一条错误消息。

data <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation.  . . Video:  . . -  #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
      "#Copingwiththelockdown... Festac town, Lagos.  #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
      "Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma  . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- as.data.frame(data)
remove_reg <- "&amp;|&lt;|&gt;"
tidy_data <- data_df %>% 
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text, token = "data_df") %>%
filter(!word %in% stop_words$word,
     !word %in% str_remove_all(stop_words$word, "'"),
     str_detect(word, "[a-z]"))

它给我以下错误信息:

Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), : argument str should be a character vector (or an object coercible to)"

我该如何解决?

主要问题是您将文本列命名为 data,但后来又将其称为 text。尝试更像这样的东西:

library(tidyverse)
library(tidytext)

text <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation.  . . Video:  . . -  #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
          "#Copingwiththelockdown... Festac town, Lagos.  #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
          "Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma  . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- tibble(text)

remove_reg <- "&amp;|&lt;|&gt;"

data_df %>% 
  mutate(text = str_remove_all(text, remove_reg)) %>%
  unnest_tokens(word, text) %>%
  anti_join(get_stopwords()) %>%
  filter(str_detect(word, "[a-z]"))
#> Joining, by = "word"
#> # A tibble: 105 x 1
#>    word      
#>    <chr>     
#>  1 said      
#>  2 cant      
#>  3 lil       
#>  4 dance     
#>  5 party     
#>  6 stuck     
#>  7 quarantine
#>  8 happy     
#>  9 friday    
#> 10 cousins   
#> # … with 95 more rows

如果您对 Twitter 数据特别感兴趣,请考虑使用 token = "tweets":

data_df %>% 
  unnest_tokens(word, text, token = "tweets")
#> Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
#> # A tibble: 121 x 1
#>    word 
#>    <chr>
#>  1 who  
#>  2 said 
#>  3 we   
#>  4 cant 
#>  5 have 
#>  6 a    
#>  7 lil  
#>  8 dance
#>  9 party
#> 10 while
#> # … with 111 more rows

reprex package (v0.3.0)

于 2020-04-12 创建

此选项可以很好地处理主题标签和用户名。