将标记连接回句子
Join tokens back to sentence
我正在使用 tidytext 对一些自由文本数据进行一些文本分析。考虑一个示例句子:
"The quick brown fox jumps over the lazy dog"
"I love books"
我使用 tidytext 的令牌方法:
unigrams = tweet_text %>%
unnest_tokens(output = word, input = txt) %>%
anti_join(stop_words)
结果如下:
The
quick
brown
fox
jumps
over
the
lazy
dog
我现在需要将每个 unigram 连接回其原始句子:
"The quick brown fox jumps over the lazy dog" | The
"The quick brown fox jumps over the lazy dog" | quick
"The quick brown fox jumps over the lazy dog" | brown
"The quick brown fox jumps over the lazy dog" | fox
"The quick brown fox jumps over the lazy dog" | jumps
"The quick brown fox jumps over the lazy dog" | over
"The quick brown fox jumps over the lazy dog" | the
"The quick brown fox jumps over the lazy dog" | lazy
"The quick brown fox jumps over the lazy dog" | dog
"I love books" | I
"I love books" | love
"I love books | books
我有点卡住了。该解决方案需要扩展到数千个句子。我认为像这样的一些功能可能是 tidytext 的原生功能,但还没有找到任何东西。
您要查找的是 drop = FALSE
参数:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidytext)
tweet_text <- tibble(id = 1:2,
text = c("The quick brown fox jumps over the lazy dog",
"I love books"))
tweet_text %>%
unnest_tokens(output = word, input = text, drop = FALSE)
#> # A tibble: 12 x 3
#> id text word
#> <int> <chr> <chr>
#> 1 1 The quick brown fox jumps over the lazy dog the
#> 2 1 The quick brown fox jumps over the lazy dog quick
#> 3 1 The quick brown fox jumps over the lazy dog brown
#> 4 1 The quick brown fox jumps over the lazy dog fox
#> 5 1 The quick brown fox jumps over the lazy dog jumps
#> 6 1 The quick brown fox jumps over the lazy dog over
#> 7 1 The quick brown fox jumps over the lazy dog the
#> 8 1 The quick brown fox jumps over the lazy dog lazy
#> 9 1 The quick brown fox jumps over the lazy dog dog
#> 10 2 I love books i
#> 11 2 I love books love
#> 12 2 I love books books
由 reprex package (v0.3.0)
于 2020 年 2 月 22 日创建
我正在使用 tidytext 对一些自由文本数据进行一些文本分析。考虑一个示例句子:
"The quick brown fox jumps over the lazy dog"
"I love books"
我使用 tidytext 的令牌方法:
unigrams = tweet_text %>%
unnest_tokens(output = word, input = txt) %>%
anti_join(stop_words)
结果如下:
The
quick
brown
fox
jumps
over
the
lazy
dog
我现在需要将每个 unigram 连接回其原始句子:
"The quick brown fox jumps over the lazy dog" | The
"The quick brown fox jumps over the lazy dog" | quick
"The quick brown fox jumps over the lazy dog" | brown
"The quick brown fox jumps over the lazy dog" | fox
"The quick brown fox jumps over the lazy dog" | jumps
"The quick brown fox jumps over the lazy dog" | over
"The quick brown fox jumps over the lazy dog" | the
"The quick brown fox jumps over the lazy dog" | lazy
"The quick brown fox jumps over the lazy dog" | dog
"I love books" | I
"I love books" | love
"I love books | books
我有点卡住了。该解决方案需要扩展到数千个句子。我认为像这样的一些功能可能是 tidytext 的原生功能,但还没有找到任何东西。
您要查找的是 drop = FALSE
参数:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidytext)
tweet_text <- tibble(id = 1:2,
text = c("The quick brown fox jumps over the lazy dog",
"I love books"))
tweet_text %>%
unnest_tokens(output = word, input = text, drop = FALSE)
#> # A tibble: 12 x 3
#> id text word
#> <int> <chr> <chr>
#> 1 1 The quick brown fox jumps over the lazy dog the
#> 2 1 The quick brown fox jumps over the lazy dog quick
#> 3 1 The quick brown fox jumps over the lazy dog brown
#> 4 1 The quick brown fox jumps over the lazy dog fox
#> 5 1 The quick brown fox jumps over the lazy dog jumps
#> 6 1 The quick brown fox jumps over the lazy dog over
#> 7 1 The quick brown fox jumps over the lazy dog the
#> 8 1 The quick brown fox jumps over the lazy dog lazy
#> 9 1 The quick brown fox jumps over the lazy dog dog
#> 10 2 I love books i
#> 11 2 I love books love
#> 12 2 I love books books
由 reprex package (v0.3.0)
于 2020 年 2 月 22 日创建