将语料库分成 R 中的 N 个词块
break corpus into chunks of N words each in R
我需要将语料库分成 N 个词块。假设这是我的语料库:
corpus <- "I need to break this corpus into chunks of ~3 words each"
解决这个问题的一种方法是将语料库变成数据框,并将其标记化
library(tidytext)
corpus_df <- as.data.frame(text = corpus)
tokens <- corpus_df %>% unnest_tokens(word, text)
然后使用下面的代码(取自here)按行拆分数据框。
chunk <- 3
n <- nrow(tokens)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(tokens,r)
这可行,但必须有更直接的方法。有拍吗?
要将字符串拆分为 N 个单词,您可以使用 tokenizers::chunk_text()
:
corpus <- "I need to break this corpus into chunks of ~3 words each"
library(tokenizers)
library(tidytext)
library(tibble)
corpus %>%
chunk_text(3)
[[1]]
[1] "i need to"
[[2]]
[1] "break this corpus"
[[3]]
[1] "into chunks of"
[[4]]
[1] "3 words each"
要return一个数据框你可以做:
corpus %>%
chunk_text(3) %>%
enframe(name = "group", value = "text") %>%
unnest_tokens(word, text)
# A tibble: 12 x 2
group word
<int> <chr>
1 1 i
2 1 need
3 1 to
4 2 break
5 2 this
6 2 corpus
7 3 into
8 3 chunks
9 3 of
10 4 3
11 4 words
12 4 each
如果您希望将这些作为 3 个单独单词的数据帧列表:
corpus %>%
chunk_text(3) %>%
enframe(name = "group", value = "text") %>%
unnest_tokens(word, text) %>%
group_split(group)
我需要将语料库分成 N 个词块。假设这是我的语料库:
corpus <- "I need to break this corpus into chunks of ~3 words each"
解决这个问题的一种方法是将语料库变成数据框,并将其标记化
library(tidytext)
corpus_df <- as.data.frame(text = corpus)
tokens <- corpus_df %>% unnest_tokens(word, text)
然后使用下面的代码(取自here)按行拆分数据框。
chunk <- 3
n <- nrow(tokens)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(tokens,r)
这可行,但必须有更直接的方法。有拍吗?
要将字符串拆分为 N 个单词,您可以使用 tokenizers::chunk_text()
:
corpus <- "I need to break this corpus into chunks of ~3 words each"
library(tokenizers)
library(tidytext)
library(tibble)
corpus %>%
chunk_text(3)
[[1]]
[1] "i need to"
[[2]]
[1] "break this corpus"
[[3]]
[1] "into chunks of"
[[4]]
[1] "3 words each"
要return一个数据框你可以做:
corpus %>%
chunk_text(3) %>%
enframe(name = "group", value = "text") %>%
unnest_tokens(word, text)
# A tibble: 12 x 2
group word
<int> <chr>
1 1 i
2 1 need
3 1 to
4 2 break
5 2 this
6 2 corpus
7 3 into
8 3 chunks
9 3 of
10 4 3
11 4 words
12 4 each
如果您希望将这些作为 3 个单独单词的数据帧列表:
corpus %>%
chunk_text(3) %>%
enframe(name = "group", value = "text") %>%
unnest_tokens(word, text) %>%
group_split(group)