如何使用 quanteda 来阻止 ngram 中的所有单词?
How to stem all words in an ngram, using quanteda?
我目前正在使用 R 中的 Quanteda 包,我想计算一组词干的 ngrams 以获得对内容词往往是什么的快速估计彼此靠近。如果我尝试:
twitter.files <- textfile(files)
twitter.docs <- corpus(twitter.files)
twitter.semantic <- twitter.docs %>%
dfm(removeTwitter = TRUE, ignoredFeatures = stopwords("english"),
ngrams = 2, skip = 0:3, stem = TRUE) %>%
trim(minCount = 50, minDoc = 2)
它只词干双字母组中的最后一个词。但是,如果我先尝试阻止:
twitter.files <- textfile(files)
twitter.docs <- corpus(twitter.files)
stemmed_no_stops <- twitter.docs %>%
toLower %>%
tokenize(removePunct = TRUE, removeTwitter = TRUE) %>%
removeFeatures(stopwords("english")) %>%
wordstem
twitter.semantic <- stemmed_no_stops %>%
skipgrams(n = 2, skip = 0:2) %>%
dfm %>%
trim(minCount=25, minDoc = 2)
然后 Quanteda 不知道如何使用词干列表;我会收到错误消息:
assignment of an object of class “NULL” is not valid for @‘ngrams’
in an object of class “dfmSparse”; is(value, "integer") is not TRUE
我可以做一个中间步骤来对词干词使用 dfm,或者告诉 dfm
先词干然后再做 ngrams 吗?
我试着用就职文本重现你的例子。使用包数据中的可重现示例,您的代码对我有用:
twitter.docs <- corpus(data_corpus_inaugural[1:5])
stemmed_no_stops <- twitter.docs %>%
tokens(remove_punct = TRUE, remove_twitter = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english")) %>%
tokens_wordstem()
lapply(stemmed_no_stops, head)
## $`1789-Washington`
## [1] "fellow-citizen" "senat" "hous" "repres" "among"
## [6] "vicissitud"
##
## $`1793-Washington`
## [1] "fellow" "citizen" "call" "upon" "voic" "countri"
##
## $`1797-Adams`
## [1] "first" "perceiv" "earli" "time" "middl" "cours"
##
## $`1801-Jefferson`
## [1] "friend" "fellow" "citizen" "call" "upon" "undertak"
##
## $`1805-Jefferson`
## [1] "proceed" "fellow" "citizen" "qualif" "constitut" "requir"
twitter.semantic <- stemmed_no_stops %>%
tokens_skipgrams(n = 2, skip = 0:2) %>%
dfm %>%
dfm_trim(min_count = 5, min_doc = 2)
twitter.semantic[1:5, 1:4]
# Document-feature matrix of: 5 documents, 4 features.
# 5 x 4 sparse Matrix of class "dfmSparse"
# features
# docs fellow_citizen let_u unit_state foreign_nation
# 1789-Washington 2 0 2 0
# 1793-Washington 1 0 0 0
# 1797-Adams 0 0 3 5
# 1801-Jefferson 5 5 0 0
# 1805-Jefferson 8 2 1 1
我目前正在使用 R 中的 Quanteda 包,我想计算一组词干的 ngrams 以获得对内容词往往是什么的快速估计彼此靠近。如果我尝试:
twitter.files <- textfile(files)
twitter.docs <- corpus(twitter.files)
twitter.semantic <- twitter.docs %>%
dfm(removeTwitter = TRUE, ignoredFeatures = stopwords("english"),
ngrams = 2, skip = 0:3, stem = TRUE) %>%
trim(minCount = 50, minDoc = 2)
它只词干双字母组中的最后一个词。但是,如果我先尝试阻止:
twitter.files <- textfile(files)
twitter.docs <- corpus(twitter.files)
stemmed_no_stops <- twitter.docs %>%
toLower %>%
tokenize(removePunct = TRUE, removeTwitter = TRUE) %>%
removeFeatures(stopwords("english")) %>%
wordstem
twitter.semantic <- stemmed_no_stops %>%
skipgrams(n = 2, skip = 0:2) %>%
dfm %>%
trim(minCount=25, minDoc = 2)
然后 Quanteda 不知道如何使用词干列表;我会收到错误消息:
assignment of an object of class “NULL” is not valid for @‘ngrams’
in an object of class “dfmSparse”; is(value, "integer") is not TRUE
我可以做一个中间步骤来对词干词使用 dfm,或者告诉 dfm
先词干然后再做 ngrams 吗?
我试着用就职文本重现你的例子。使用包数据中的可重现示例,您的代码对我有用:
twitter.docs <- corpus(data_corpus_inaugural[1:5])
stemmed_no_stops <- twitter.docs %>%
tokens(remove_punct = TRUE, remove_twitter = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english")) %>%
tokens_wordstem()
lapply(stemmed_no_stops, head)
## $`1789-Washington`
## [1] "fellow-citizen" "senat" "hous" "repres" "among"
## [6] "vicissitud"
##
## $`1793-Washington`
## [1] "fellow" "citizen" "call" "upon" "voic" "countri"
##
## $`1797-Adams`
## [1] "first" "perceiv" "earli" "time" "middl" "cours"
##
## $`1801-Jefferson`
## [1] "friend" "fellow" "citizen" "call" "upon" "undertak"
##
## $`1805-Jefferson`
## [1] "proceed" "fellow" "citizen" "qualif" "constitut" "requir"
twitter.semantic <- stemmed_no_stops %>%
tokens_skipgrams(n = 2, skip = 0:2) %>%
dfm %>%
dfm_trim(min_count = 5, min_doc = 2)
twitter.semantic[1:5, 1:4]
# Document-feature matrix of: 5 documents, 4 features.
# 5 x 4 sparse Matrix of class "dfmSparse"
# features
# docs fellow_citizen let_u unit_state foreign_nation
# 1789-Washington 2 0 2 0
# 1793-Washington 1 0 0 0
# 1797-Adams 0 0 3 5
# 1801-Jefferson 5 5 0 0
# 1805-Jefferson 8 2 1 1