将文本拆分为 ngrams 而不会在 R 中重叠

Question

我有一个数据框，其中一列包含冗长的文字记录。我想使用 unnest_tokens 将成绩单拆分为 50 个单词的 ngram。以下代码将拆分成绩单：

content <- data.frame(channel=c("NBC"), program=c("A"), transcript=c("This is a rather unusual glossary in that all of the words on the list are essentially synonymous - they are nouns meaning nonsense, gibberish, claptrap, hogwash, rubbish ... you get the idea. It probably shouldn't be surprising that this category is so productive of weird words. After all, what better way to disparage someone's ideas than to combine some nonsense syllables to make a descriptor for them? You more or less always can identify their meaning from context alone - either they're used as interjections, preceded by words like 'such' or 'unadulterated' or 'ridiculous'. But which to choose? You have the reduplicated ones (fiddle-faddle), the pseudo-classical (brimborion), the ones that literally mean something repulsive (spinach), and of course the wide variety that are euphemisms for bodily functions. Excluded from this list are the wide variety of very fun terms that are simple vulgarities without any specific reference to nonsense."))

content_ngram <- content %>%
  unnest_tokens(output=sentence, input=transcript, token="ngrams", n=50)

因为这个特定的转录本有 100 个词长，所以生成的数据框包含 100 个观察值，其中第一个印迹包含前 50 个词，第二个包含第 2 到第 51 个词，依此类推。相反，我希望将转录本拆分为不重叠的 ngram。在上面的示例中，我想要一个包含两个观察值的数据框，其中第一个观察值包含一个包含单词 1-50 的 ngram，第二个观察值包含一个包含单词 51-100 的 ngram。

Answer 1

您可以选择的一个选项是标记为单个单词，然后连接回您感兴趣的块。这可能更适合，因为 n-gram 标记化根据定义重叠。

library(tidyverse)
library(tidytext)

content <- tibble(channel = c("NBC"), 
                  program = c("A"), 
                  transcript = c("This is a rather unusual glossary in that all of the words on the list are essentially synonymous - they are nouns meaning nonsense, gibberish, claptrap, hogwash, rubbish ... you get the idea. It probably shouldn't be surprising that this category is so productive of weird words. After all, what better way to disparage someone's ideas than to combine some nonsense syllables to make a descriptor for them? You more or less always can identify their meaning from context alone - either they're used as interjections, preceded by words like 'such' or 'unadulterated' or 'ridiculous'. But which to choose? You have the reduplicated ones (fiddle-faddle), the pseudo-classical (brimborion), the ones that literally mean something repulsive (spinach), and of course the wide variety that are euphemisms for bodily functions. Excluded from this list are the wide variety of very fun terms that are simple vulgarities without any specific reference to nonsense."))

content %>%
  unnest_tokens(output = sentence, 
                input = transcript) %>%
  group_by(channel, program, observation = row_number() %/% 100) %>%
  summarise(sentence = str_c(sentence, collapse = " ")) %>%
  ungroup

#> # A tibble: 2 x 4
#>   channel program observation sentence                                     
#>   <chr>   <chr>         <dbl> <chr>                                        
#> 1 NBC     A                 0 this is a rather unusual glossary in that al…
#> 2 NBC     A                 1 reduplicated ones fiddle faddle the pseudo c…

^{由 reprex package (v0.3.0)}

于 2019-12-13 创建

将文本拆分为 ngrams 而不会在 R 中重叠

Split text into ngrams without overlap in R

r

n-gram

tidytext