将语料库分成 R 中的 N 个词块

break corpus into chunks of N words each in R

我需要将语料库分成 N 个词块。假设这是我的语料库:

corpus <- "I need to break this corpus into chunks of ~3 words each"

解决这个问题的一种方法是将语料库变成数据框,并将其标记化

library(tidytext)
corpus_df <- as.data.frame(text = corpus)
tokens <- corpus_df %>% unnest_tokens(word, text)

然后使用下面的代码(取自here)按行拆分数据框。

chunk <- 3
n <- nrow(tokens)
r  <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(tokens,r)

这可行,但必须有更直接的方法。有拍吗?

要将字符串拆分为 N 个单词,您可以使用 tokenizers::chunk_text():

corpus <- "I need to break this corpus into chunks of ~3 words each"

library(tokenizers)
library(tidytext)
library(tibble)

corpus %>%
  chunk_text(3)

[[1]]
[1] "i need to"

[[2]]
[1] "break this corpus"

[[3]]
[1] "into chunks of"

[[4]]
[1] "3 words each"

要return一个数据框你可以做:

corpus %>%
  chunk_text(3) %>%
  enframe(name = "group", value = "text") %>%
  unnest_tokens(word, text)

# A tibble: 12 x 2
   group word  
   <int> <chr> 
 1     1 i     
 2     1 need  
 3     1 to    
 4     2 break 
 5     2 this  
 6     2 corpus
 7     3 into  
 8     3 chunks
 9     3 of    
10     4 3     
11     4 words 
12     4 each  

如果您希望将这些作为 3 个单独单词的数据帧列表:

 corpus %>%
   chunk_text(3) %>%
   enframe(name = "group", value = "text") %>%
   unnest_tokens(word, text) %>%
   group_split(group)