在 R 中输出包含一元字母和二元字母的文本
Output text with both unigrams and bigrams in R
我正在尝试找出如何在 R 中识别文本中的一元字母和二元字母,然后根据阈值将两者保留在最终输出中。我已经在 Python 中使用 gensim 的 Phraser 模型完成了此操作,但还没有弄清楚如何在 R 中执行此操作。
例如:
strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theater yesterday')
#Pseudocode below
bigs <- tokenize_uni_bi(strings, n = 1:2, threshold = 2)
print(bigs)
[['this', 'great_movie', 'yesterday'], ['went', 'movies'], ['great_movie', 'theater'], ['went', 'theater', 'yesterday']]
谢谢!
您可以为此使用 quanteda 框架:
library(quanteda)
# tokenize, tolower, remove stopwords and create ngrams
my_toks <- tokens(strings$text)
my_toks <- tokens_tolower(my_toks)
my_toks <- tokens_remove(my_toks, stopwords("english"))
bigs <- tokens_ngrams(my_toks, n = 1:2)
# turn into document feature matrix and filter on minimum frequency of 2 and more
my_dfm <- dfm(bigs)
dfm_trim(my_dfm, min_termfreq = 2)
Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
features
docs great movie yesterday great_movie went theater
text1 1 1 1 1 0 0
text2 0 0 0 0 1 0
text3 1 1 0 1 0 1
text4 0 0 1 0 1 1
# use convert function to turn this into a data.frame
或者您可以使用 tidytext 包、tm、tokenizers 等等。这完全取决于您期望的输出。
使用 tidytext / dplyr 的示例如下所示:
library(tidytext)
library(dplyr)
strings %>%
unnest_ngrams(bigs, text, n = 2, n_min = 1, ngram_delim = "_", stopwords = stopwords::stopwords()) %>%
count(bigs) %>%
filter(n >= 2)
bigs n
1 great 2
2 great_movie 2
3 movie 2
4 theater 2
5 went 2
6 yesterday 2
quanteda 和 tidytext 都有很多可用的联机帮助。查看 cran 上两个包的小插图。
我正在尝试找出如何在 R 中识别文本中的一元字母和二元字母,然后根据阈值将两者保留在最终输出中。我已经在 Python 中使用 gensim 的 Phraser 模型完成了此操作,但还没有弄清楚如何在 R 中执行此操作。
例如:
strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theater yesterday')
#Pseudocode below
bigs <- tokenize_uni_bi(strings, n = 1:2, threshold = 2)
print(bigs)
[['this', 'great_movie', 'yesterday'], ['went', 'movies'], ['great_movie', 'theater'], ['went', 'theater', 'yesterday']]
谢谢!
您可以为此使用 quanteda 框架:
library(quanteda)
# tokenize, tolower, remove stopwords and create ngrams
my_toks <- tokens(strings$text)
my_toks <- tokens_tolower(my_toks)
my_toks <- tokens_remove(my_toks, stopwords("english"))
bigs <- tokens_ngrams(my_toks, n = 1:2)
# turn into document feature matrix and filter on minimum frequency of 2 and more
my_dfm <- dfm(bigs)
dfm_trim(my_dfm, min_termfreq = 2)
Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
features
docs great movie yesterday great_movie went theater
text1 1 1 1 1 0 0
text2 0 0 0 0 1 0
text3 1 1 0 1 0 1
text4 0 0 1 0 1 1
# use convert function to turn this into a data.frame
或者您可以使用 tidytext 包、tm、tokenizers 等等。这完全取决于您期望的输出。
使用 tidytext / dplyr 的示例如下所示:
library(tidytext)
library(dplyr)
strings %>%
unnest_ngrams(bigs, text, n = 2, n_min = 1, ngram_delim = "_", stopwords = stopwords::stopwords()) %>%
count(bigs) %>%
filter(n >= 2)
bigs n
1 great 2
2 great_movie 2
3 movie 2
4 theater 2
5 went 2
6 yesterday 2
quanteda 和 tidytext 都有很多可用的联机帮助。查看 cran 上两个包的小插图。