如何计算 Quanteda 中多词表达的频率?

How to count frequency of a multiword expression in Quanteda?

我正在尝试计算 Quanteda 中多词表达的频率。我知道语料库中有几篇文章包含这个表达式,因为当我使用 Python 中的 're' 查找它时,它可以找到它们。但是,对于 Quanteda,它似乎不起作用。谁能告诉我我做错了什么?

> mwes <- phrase(c("抗美 援朝"))
> tc <- tokens_compound(toks_NK, mwes, concatenator = "")
> dfm <- dfm(tc, select="抗美援朝")
> dfm
Document-feature matrix of: 2,337 documents, 0 features and 7 docvars.
[ reached max_ndoc ... 2,331 more documents ]

首先,很抱歉无法使用完整的中文文本。但是我冒昧地在总统讲话中插入了您的普通话:

data <- "I stand here today humbled by the task before us 抗美 援朝, 
grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. 
I thank President Bush for his service to our nation, 
as well as the generosity and cooperation he has shown throughout this transition.

Forty-four Americans 抗美 援朝 have now taken the presidential oath. 
The words have been spoken during rising tides of prosperity 
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments, 
America has carried on not simply because of the skill or vision of those in high office, 
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers, 
and true to our founding documents."

如果你想使用 quanteda,你可以做什么,你可以计算 4-grams(我认为你的单词由四个符号组成,因此将被视为四个单词)

第 1 步:将文本拆分为单词标记:

data_tokens <- tokens(data, remove_punct = TRUE, remove_numbers = TRUE)

第 2 步:计算 4-grams 并制作它们的频率列表

fourgrams <- sort(table(unlist(as.character(tokens_ngrams(data_tokens, n = 4, concatenator = " ")))), decreasing = T)

您可以查看前十个:

fourgrams[1:10]

                抗 美 援 朝               美 援 朝 have      America has carried on          Americans 抗 美 援 
                          4                           2                           1                           1 
amidst gathering clouds and ancestors I thank President      and cooperation he has        and raging storms At 
                          1                           1                           1                           1 
       and the still waters             and true to our 
                          1                           1 

如果您只想知道目标化合物的频率:

fourgrams["抗 美 援 朝"]
抗 美 援 朝 
         4 

或者,更简单,特别是如果您真的只对单个化合物感兴趣,您可以使用 stringr 中的 str_extract_all。这将立即为您提供频率计数:

library(stringr)
length(unlist(str_extract_all(data, "抗美 援朝")))
[1] 4

您的方向是正确的,但是 quanteda 的默认分词器似乎将您短语中的分词分成四个字符:

> tokens("抗美 援朝")
Tokens consisting of 1 document.
text1 :
[1] "抗" "美" "援" "朝"

出于这些原因,您应该考虑使用替代分词器。幸运的是,优秀的 spaCy Python 库提供了一种方法来做到这一点,并且有中文语言模型。使用 spacyr 包和 quanteda,您可以在加载小型中文语言模型后直接从 spacyr::spacy_tokenize() 的输出创建标记。

要仅计算这些表达式,您可以在 dfm 上使用 tokens_select()textstat_frequency() 的组合。

library("quanteda")
## Package version: 2.1.0

txt <- "Forty-four Americans 抗美 援朝 have now taken the presidential oath. 
The words have been spoken during rising tides of prosperity 
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments, 
America has carried on not simply because of the skill or vision of those in high office, 
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers, 
and true to our founding documents."

library("spacyr")
# spacy_download_langmodel("zh_core_web_sm") # only needs to be done once
spacy_initialize(model = "zh_core_web_sm")
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.2, language model: zh_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")

spacy_tokenize(txt) %>%
  as.tokens() %>%
  tokens_compound(pattern = phrase("抗美 援朝"), concatenator = " ") %>%
  tokens_select("抗美 援朝") %>%
  dfm() %>%
  textstat_frequency()
##     feature frequency rank docfreq group
## 1 抗美 援朝         3    1       1   all

一般来说,最好做一个字典来查找或复合中文或日文的token,但字典值应该像token一样切分。

require(quanteda)
require(stringi)

txt <- "10月初,聯合國軍逆轉戰情,向北開進,越過38度線,終促使中华人民共和国決定出兵介入,中国称此为抗美援朝。"
lis <- list(mwe1 = "抗美援朝", mwe2 = "向北開進")

## tokenize dictionary values
lis <- lapply(lis, function(x) stri_c_list(as.list(tokens(x)), sep = " "))
dict <- dictionary(lis)

## tokenize texts and count
toks <- tokens(txt)
dfm(tokens_lookup(toks, dict))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
##        features
## docs    mwe1 mwe2
##   text1    1    1