
Join tokens back to sentence

我正在使用 tidytext 对一些自由文本数据进行一些文本分析。考虑一个示例句子:

"The quick brown fox jumps over the lazy dog"
"I love books"

我使用 tidytext 的令牌方法:

unigrams = tweet_text %>% 
  unnest_tokens(output = word, input = txt) %>%



我现在需要将每个 unigram 连接回其原始句子:

"The quick brown fox jumps over the lazy dog" | The
"The quick brown fox jumps over the lazy dog" | quick
"The quick brown fox jumps over the lazy dog" | brown
"The quick brown fox jumps over the lazy dog" | fox
"The quick brown fox jumps over the lazy dog" | jumps 
"The quick brown fox jumps over the lazy dog" | over
"The quick brown fox jumps over the lazy dog" | the
"The quick brown fox jumps over the lazy dog" | lazy 
"The quick brown fox jumps over the lazy dog" | dog
"I love books" | I
"I love books" | love
"I love books  | books

我有点卡住了。该解决方案需要扩展到数千个句子。我认为像这样的一些功能可能是 tidytext 的原生功能,但还没有找到任何东西。

您要查找的是 drop = FALSE 参数:

#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>     filter, lag
#> The following objects are masked from 'package:base':
#>     intersect, setdiff, setequal, union

tweet_text <- tibble(id = 1:2,
                     text = c("The quick brown fox jumps over the lazy dog",
                              "I love books"))

tweet_text %>% 
  unnest_tokens(output = word, input = text, drop = FALSE)
#> # A tibble: 12 x 3
#>       id text                                        word 
#>    <int> <chr>                                       <chr>
#>  1     1 The quick brown fox jumps over the lazy dog the  
#>  2     1 The quick brown fox jumps over the lazy dog quick
#>  3     1 The quick brown fox jumps over the lazy dog brown
#>  4     1 The quick brown fox jumps over the lazy dog fox  
#>  5     1 The quick brown fox jumps over the lazy dog jumps
#>  6     1 The quick brown fox jumps over the lazy dog over 
#>  7     1 The quick brown fox jumps over the lazy dog the  
#>  8     1 The quick brown fox jumps over the lazy dog lazy 
#>  9     1 The quick brown fox jumps over the lazy dog dog  
#> 10     2 I love books                                i    
#> 11     2 I love books                                love 
#> 12     2 I love books                                books

reprex package (v0.3.0)

于 2020 年 2 月 22 日创建