加入 quanteda dfm 前十名 1grams 和所有 dfm 2 到 5grams
join quanteda dfm top ten 1grams with all dfm 2 thru 5grams
为了节省内存 space 在处理非常大的语料库样本时,我希望只取前 10 个 1grams 并将它们与所有 2 到 5grams 组合起来形成我的单个 quanteda::dfmSparse 将用于自然语言处理 [nlp] 预测的对象。随身携带所有 1 克将毫无意义,因为只有前十 [ 或二十 ] 会用于我正在使用的简单后退模型。
我找不到一个 quanteda::dfm(corpusText, . . .) 参数来指示它仅 return 最重要的 ## 特征。因此,根据包作者@KenB 在其他线程中的评论,我正在使用 dfm_select/remove 函数提取前十个 1gram,并根据 "quanteda dfm join" 搜索结果点击“” i'我使用 rbind.dfmSparse???加入这些结果的函数。
到目前为止,据我所知,一切看起来都是正确的。我想我会从 SO 社区中取消这个游戏计划,看看我是否忽略了一条更有效的途径来达到这个结果,或者我到目前为止已经到达的解决方案中存在一些缺陷。
corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
"adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)
dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10])
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen])
dfmTopTen1grams; featnames(dfmTopTen1grams)
dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)
/eoq
对于提取前 10 个 unigrams,此策略可以正常工作:
按整体特征频率的(默认)递减顺序对 dfm 进行排序,您已经这样做了,但随后添加一个步骤 tp 切出前 10 列。
使用 cbind()
(不是 rbind()
)将其与 2 到 5 克 dfm 结合。
应该这样做:
dfmCombined <- cbind(dfm1gramsSorted[, 1:10], dfm2to5grams)
head(dfmCombined, nfeat = 15)
# Document-feature matrix of: 1 document, 195 features (0% sparse).
# (showing first document and first 15 features)
# features
# docs some corpus text of to very large top ten no some_corpus corpus_text text_of of_no no_consequence
# text1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1
您的示例代码包括对 data.table 的一些使用,尽管这没有出现在问题中。在 v0.99 中,我们添加了一个新函数 textstat_frequency()
,它可以在 data.frame 中生成 "long"/"tidy" 格式的频率,这可能会有帮助:
head(textstat_frequency(dfmCombined), 10)
# feature frequency rank docfreq
# 1 some 2 1 1
# 2 corpus 2 2 1
# 3 text 2 3 1
# 4 of 2 4 1
# 5 to 2 5 1
# 6 very 2 6 1
# 7 large 2 7 1
# 8 top 2 8 1
# 9 ten 2 9 1
# 10 some_corpus 2 10 1
为了节省内存 space 在处理非常大的语料库样本时,我希望只取前 10 个 1grams 并将它们与所有 2 到 5grams 组合起来形成我的单个 quanteda::dfmSparse 将用于自然语言处理 [nlp] 预测的对象。随身携带所有 1 克将毫无意义,因为只有前十 [ 或二十 ] 会用于我正在使用的简单后退模型。
我找不到一个 quanteda::dfm(corpusText, . . .) 参数来指示它仅 return 最重要的 ## 特征。因此,根据包作者@KenB 在其他线程中的评论,我正在使用 dfm_select/remove 函数提取前十个 1gram,并根据 "quanteda dfm join" 搜索结果点击“
到目前为止,据我所知,一切看起来都是正确的。我想我会从 SO 社区中取消这个游戏计划,看看我是否忽略了一条更有效的途径来达到这个结果,或者我到目前为止已经到达的解决方案中存在一些缺陷。
corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
"adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)
dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10])
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen])
dfmTopTen1grams; featnames(dfmTopTen1grams)
dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)
/eoq
对于提取前 10 个 unigrams,此策略可以正常工作:
按整体特征频率的(默认)递减顺序对 dfm 进行排序,您已经这样做了,但随后添加一个步骤 tp 切出前 10 列。
使用
cbind()
(不是rbind()
)将其与 2 到 5 克 dfm 结合。
应该这样做:
dfmCombined <- cbind(dfm1gramsSorted[, 1:10], dfm2to5grams)
head(dfmCombined, nfeat = 15)
# Document-feature matrix of: 1 document, 195 features (0% sparse).
# (showing first document and first 15 features)
# features
# docs some corpus text of to very large top ten no some_corpus corpus_text text_of of_no no_consequence
# text1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1
您的示例代码包括对 data.table 的一些使用,尽管这没有出现在问题中。在 v0.99 中,我们添加了一个新函数 textstat_frequency()
,它可以在 data.frame 中生成 "long"/"tidy" 格式的频率,这可能会有帮助:
head(textstat_frequency(dfmCombined), 10)
# feature frequency rank docfreq
# 1 some 2 1 1
# 2 corpus 2 2 1
# 3 text 2 3 1
# 4 of 2 4 1
# 5 to 2 5 1
# 6 very 2 6 1
# 7 large 2 7 1
# 8 top 2 8 1
# 9 ten 2 9 1
# 10 some_corpus 2 10 1