计算 R 中字典文件中的单词数

Count number of words in a Dictionary file in R

我正在通过 quanteda 包将字典读入 R。这个包预装了一些很棒的词典,其中之一是我感兴趣的道德基础词典。这本词典有几个类别(农场、公平、内部团体等),分为美德和副类别。

我想计算 R 中每个基础的每个子类别中的单词数。我该怎么做?

对于可重现的示例,我可以通过 运行 library(quanteda.dictionaries)

访问道德基础词典(标记为 data_dictionary_MFD

谢谢!

不确定您的 MFD 语料库是什么样的;如果它是 osf.io/whjt2 上托管的,那么前六行将如下所示(mfd 作为数据集的名称,WordtokenMFDcategory 作为我的专栏headers):

head(mfd)
    Wordtoken MFDcategory
1  compassion           1
2     empathy           1
3    kindness           1
4      caring           1
5  generosity           1
6 benevolence           1

如果您的目标只是找出 MFDcategory 的十个级别中的每个级别下列出了多少个单词,那么您所要做的就是对该栏使用 table:

table(mfd$MFDcategory)

  1   2   3   4   5   6   7   8   9  10 
182 288 115 236 143  49 301 130 272 388

也就是说,类别 1 有 182 个单词标记,即 care.virtue,而类别 2 有 288 个单词标记,即 care.vice,依此类推。这有帮助吗?

不完全清楚您在寻找什么,但这可能归结为术语。 quanteda 词典对规范类别(在 R 中,列表元素的名称)使用 "keys" 的术语,对用于匹配单词的模式使用 "values"用于计算每个键的出现次数。

MFD有两组"keys":道德"foundations"如关怀、公平等,以及"valences"以"vice"和"virtue"为代表对于每个基金会类别。然而,正如我们在 quanteda.dictionaries::data_dictionary_MFD 中记录的那样 - 至少在 quanteda.dictionaries 的 v0.22 中 - 字典被扁平化到只有一个级别。

我们可以看到这一点,统计一下这里结合foundation和valence的各个字典"key"中的值,如下:

library("quanteda")
## Package version: 1.5.2

data(data_dictionary_MFD, package = "quanteda.dictionaries")

# number of "words" in each MFD dictionary key
lengths(data_dictionary_MFD)
##      care.virtue        care.vice  fairness.virtue    fairness.vice 
##              182              288              115              236 
##   loyalty.virtue     loyalty.vice authority.virtue   authority.vice 
##              142               49              301              130 
##  sanctity.virtue    sanctity.vice 
##              272              388

# first 5 values in each dictionary key
lapply(data_dictionary_MFD, head, 5)
## $care.virtue
## [1] "alleviate"   "alleviated"  "alleviates"  "alleviating" "alleviation"
## 
## $care.vice
## [1] "abused"  "abuser"  "abusers" "abuses"  "abusing"
## 
## $fairness.virtue
## [1] "avenge"   "avenged"  "avenger"  "avengers" "avenges" 
## 
## $fairness.vice
## [1] "am partial"  "bamboozle"   "bamboozled"  "bamboozles"  "bamboozling"
## 
## $loyalty.virtue
## [1] "all for one" "allegiance"  "allegiances" "allegiant"   "allied"     
## 
## $loyalty.vice
## [1] "against us"  "apostate"    "apostates"   "backstab"    "backstabbed"
## 
## $authority.virtue
## [1] "acquiesce"   "acquiesced"  "acquiescent" "acquiesces"  "acquiescing"
## 
## $authority.vice
## [1] "anarchist"   "anarchistic" "anarchists"  "anarchy"     "apostate"   
## 
## $sanctity.virtue
## [1] "abstinance" "abstinence" "allah"      "almighty"   "angel"     
## 
## $sanctity.vice
## [1] "abhor"    "abhored"  "abhors"   "addict"   "addicted"

要应用它来计算匹配 "key"(基础和价的组合)的单词,我们可以创建一个 dfm,然后使用 dfm_lookup():

# number of words in a text matching the MFD dictionary
dfm(data_corpus_inaugural) %>%
  dfm_lookup(dictionary = data_dictionary_MFD) %>%
  tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
##               features
## docs           care.virtue care.vice fairness.virtue fairness.vice
##   1997-Clinton           8         4               6             2
##   2001-Bush             21         8              11             1
##   2005-Bush             14        12              16             4
##   2009-Obama            18         6               8             1
##   2013-Obama            14         6              15             2
##   2017-Trump            16         7               2             4
##               features
## docs           loyalty.virtue loyalty.vice authority.virtue authority.vice
##   1997-Clinton             37            0                3              0
##   2001-Bush                36            1               18              2
##   2005-Bush                38            3               33              4
##   2009-Obama               33            1               18              2
##   2013-Obama               39            2               12              0
##   2017-Trump               44            0               20              1
##               features
## docs           sanctity.virtue sanctity.vice
##   1997-Clinton              14             8
##   2001-Bush                 21             1
##   2005-Bush                 16             0
##   2009-Obama                18             3
##   2013-Obama                14             0
##   2017-Trump                13             3

但是有更好的方法利用MFD的嵌套结构,但是我们需要先修改字典对象使其嵌套。供货时,MFD 已经 "flattened"。我们想将其展开,以便基础形成第一级键,而化合价形成第二级键。然后,使用 tokens_lookup()dfm_lookup() 中的 levels 参数,我们将能够选择我们在文本中计算匹配项的级别。

首先,重新创建字典以使其嵌套。

# remake the dictionary into nested catetgory of foundation and valence
data_dictionary_MFDnested <-
  dictionary(list(
    care = list(
      virtue = data_dictionary_MFD[["care.virtue"]],
      vice = data_dictionary_MFD[["care.vice"]]
    ),
    fairness = list(
      virtue = data_dictionary_MFD[["fairness.virtue"]],
      vice = data_dictionary_MFD[["fairness.vice"]]
    ),
    loyalty = list(
      virtue = data_dictionary_MFD[["loyalty.virtue"]],
      vice = data_dictionary_MFD[["loyalty.vice"]]
    ),
    authority = list(
      virtue = data_dictionary_MFD[["authority.virtue"]],
      vice = data_dictionary_MFD[["authority.vice"]]
    ),
    sanctity = list(
      virtue = data_dictionary_MFD[["sanctity.virtue"]],
      vice = data_dictionary_MFD[["sanctity.vice"]]
    )
  ))

检查这个我们可以看到字典的详细信息:

lengths(data_dictionary_MFDnested)
##      care  fairness   loyalty authority  sanctity 
##         2         2         2         2         2
lapply(data_dictionary_MFDnested, lengths)
## $care
## virtue   vice 
##    182    288 
## 
## $fairness
## virtue   vice 
##    115    236 
## 
## $loyalty
## virtue   vice 
##    142     49 
## 
## $authority
## virtue   vice 
##    301    130 
## 
## $sanctity
## virtue   vice 
##    272    388

现在我们可以将它应用到我们的文本中:

# now apply it to texts
dfm(data_corpus_inaugural) %>%
  dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1) %>%
  tail()
## Document-feature matrix of: 6 documents, 5 features (0.0% sparse).
## 6 x 5 sparse Matrix of class "dfm"
##               features
## docs           care fairness loyalty authority sanctity
##   1997-Clinton   12        8      37         3       22
##   2001-Bush      29       12      37        20       22
##   2005-Bush      26       20      41        37       16
##   2009-Obama     24        9      34        20       21
##   2013-Obama     20       17      41        12       14
##   2017-Trump     23        6      44        21       16

dfm(data_corpus_inaugural) %>%
  dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 2) %>%
  tail()
## Document-feature matrix of: 6 documents, 2 features (0.0% sparse).
## 6 x 2 sparse Matrix of class "dfm"
##               features
## docs           virtue vice
##   1997-Clinton     68   14
##   2001-Bush       107   13
##   2005-Bush       117   23
##   2009-Obama       95   13
##   2013-Obama       94   10
##   2017-Trump       95   15

指定两个级别(或默认 levels = 1:5)与我们最初使用扁平化字典的内容相匹配:

dfm(data_corpus_inaugural) %>%
  dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1:2) %>%
  tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
##               features
## docs           care.virtue care.vice fairness.virtue fairness.vice
##   1997-Clinton           8         4               6             2
##   2001-Bush             21         8              11             1
##   2005-Bush             14        12              16             4
##   2009-Obama            18         6               8             1
##   2013-Obama            14         6              15             2
##   2017-Trump            16         7               2             4
##               features
## docs           loyalty.virtue loyalty.vice authority.virtue authority.vice
##   1997-Clinton             37            0                3              0
##   2001-Bush                36            1               18              2
##   2005-Bush                38            3               33              4
##   2009-Obama               33            1               18              2
##   2013-Obama               39            2               12              0
##   2017-Trump               44            0               20              1
##               features
## docs           sanctity.virtue sanctity.vice
##   1997-Clinton              14             8
##   2001-Bush                 21             1
##   2005-Bush                 16             0
##   2009-Obama                18             3
##   2013-Obama                14             0
##   2017-Trump                13             3