计算 R 中字典文件中的单词数
Count number of words in a Dictionary file in R
我正在通过 quanteda
包将字典读入 R。这个包预装了一些很棒的词典,其中之一是我感兴趣的道德基础词典。这本词典有几个类别(农场、公平、内部团体等),分为美德和副类别。
我想计算 R 中每个基础的每个子类别中的单词数。我该怎么做?
对于可重现的示例,我可以通过 运行 library(quanteda.dictionaries)
访问道德基础词典(标记为 data_dictionary_MFD
)
谢谢!
不确定您的 MFD 语料库是什么样的;如果它是 osf.io/whjt2 上托管的,那么前六行将如下所示(mfd
作为数据集的名称,Wordtoken
和 MFDcategory
作为我的专栏headers):
head(mfd)
Wordtoken MFDcategory
1 compassion 1
2 empathy 1
3 kindness 1
4 caring 1
5 generosity 1
6 benevolence 1
如果您的目标只是找出 MFDcategory
的十个级别中的每个级别下列出了多少个单词,那么您所要做的就是对该栏使用 table
:
table(mfd$MFDcategory)
1 2 3 4 5 6 7 8 9 10
182 288 115 236 143 49 301 130 272 388
也就是说,类别 1 有 182 个单词标记,即 care.virtue,而类别 2 有 288 个单词标记,即 care.vice,依此类推。这有帮助吗?
不完全清楚您在寻找什么,但这可能归结为术语。 quanteda 词典对规范类别(在 R 中,列表元素的名称)使用 "keys" 的术语,对用于匹配单词的模式使用 "values"用于计算每个键的出现次数。
MFD有两组"keys":道德"foundations"如关怀、公平等,以及"valences"以"vice"和"virtue"为代表对于每个基金会类别。然而,正如我们在 quanteda.dictionaries::data_dictionary_MFD
中记录的那样 - 至少在 quanteda.dictionaries 的 v0.22 中 - 字典被扁平化到只有一个级别。
我们可以看到这一点,统计一下这里结合foundation和valence的各个字典"key"中的值,如下:
library("quanteda")
## Package version: 1.5.2
data(data_dictionary_MFD, package = "quanteda.dictionaries")
# number of "words" in each MFD dictionary key
lengths(data_dictionary_MFD)
## care.virtue care.vice fairness.virtue fairness.vice
## 182 288 115 236
## loyalty.virtue loyalty.vice authority.virtue authority.vice
## 142 49 301 130
## sanctity.virtue sanctity.vice
## 272 388
# first 5 values in each dictionary key
lapply(data_dictionary_MFD, head, 5)
## $care.virtue
## [1] "alleviate" "alleviated" "alleviates" "alleviating" "alleviation"
##
## $care.vice
## [1] "abused" "abuser" "abusers" "abuses" "abusing"
##
## $fairness.virtue
## [1] "avenge" "avenged" "avenger" "avengers" "avenges"
##
## $fairness.vice
## [1] "am partial" "bamboozle" "bamboozled" "bamboozles" "bamboozling"
##
## $loyalty.virtue
## [1] "all for one" "allegiance" "allegiances" "allegiant" "allied"
##
## $loyalty.vice
## [1] "against us" "apostate" "apostates" "backstab" "backstabbed"
##
## $authority.virtue
## [1] "acquiesce" "acquiesced" "acquiescent" "acquiesces" "acquiescing"
##
## $authority.vice
## [1] "anarchist" "anarchistic" "anarchists" "anarchy" "apostate"
##
## $sanctity.virtue
## [1] "abstinance" "abstinence" "allah" "almighty" "angel"
##
## $sanctity.vice
## [1] "abhor" "abhored" "abhors" "addict" "addicted"
要应用它来计算匹配 "key"(基础和价的组合)的单词,我们可以创建一个 dfm,然后使用 dfm_lookup()
:
# number of words in a text matching the MFD dictionary
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFD) %>%
tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
## features
## docs care.virtue care.vice fairness.virtue fairness.vice
## 1997-Clinton 8 4 6 2
## 2001-Bush 21 8 11 1
## 2005-Bush 14 12 16 4
## 2009-Obama 18 6 8 1
## 2013-Obama 14 6 15 2
## 2017-Trump 16 7 2 4
## features
## docs loyalty.virtue loyalty.vice authority.virtue authority.vice
## 1997-Clinton 37 0 3 0
## 2001-Bush 36 1 18 2
## 2005-Bush 38 3 33 4
## 2009-Obama 33 1 18 2
## 2013-Obama 39 2 12 0
## 2017-Trump 44 0 20 1
## features
## docs sanctity.virtue sanctity.vice
## 1997-Clinton 14 8
## 2001-Bush 21 1
## 2005-Bush 16 0
## 2009-Obama 18 3
## 2013-Obama 14 0
## 2017-Trump 13 3
但是有更好的方法利用MFD的嵌套结构,但是我们需要先修改字典对象使其嵌套。供货时,MFD 已经 "flattened"。我们想将其展开,以便基础形成第一级键,而化合价形成第二级键。然后,使用 tokens_lookup()
和 dfm_lookup()
中的 levels
参数,我们将能够选择我们在文本中计算匹配项的级别。
首先,重新创建字典以使其嵌套。
# remake the dictionary into nested catetgory of foundation and valence
data_dictionary_MFDnested <-
dictionary(list(
care = list(
virtue = data_dictionary_MFD[["care.virtue"]],
vice = data_dictionary_MFD[["care.vice"]]
),
fairness = list(
virtue = data_dictionary_MFD[["fairness.virtue"]],
vice = data_dictionary_MFD[["fairness.vice"]]
),
loyalty = list(
virtue = data_dictionary_MFD[["loyalty.virtue"]],
vice = data_dictionary_MFD[["loyalty.vice"]]
),
authority = list(
virtue = data_dictionary_MFD[["authority.virtue"]],
vice = data_dictionary_MFD[["authority.vice"]]
),
sanctity = list(
virtue = data_dictionary_MFD[["sanctity.virtue"]],
vice = data_dictionary_MFD[["sanctity.vice"]]
)
))
检查这个我们可以看到字典的详细信息:
lengths(data_dictionary_MFDnested)
## care fairness loyalty authority sanctity
## 2 2 2 2 2
lapply(data_dictionary_MFDnested, lengths)
## $care
## virtue vice
## 182 288
##
## $fairness
## virtue vice
## 115 236
##
## $loyalty
## virtue vice
## 142 49
##
## $authority
## virtue vice
## 301 130
##
## $sanctity
## virtue vice
## 272 388
现在我们可以将它应用到我们的文本中:
# now apply it to texts
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1) %>%
tail()
## Document-feature matrix of: 6 documents, 5 features (0.0% sparse).
## 6 x 5 sparse Matrix of class "dfm"
## features
## docs care fairness loyalty authority sanctity
## 1997-Clinton 12 8 37 3 22
## 2001-Bush 29 12 37 20 22
## 2005-Bush 26 20 41 37 16
## 2009-Obama 24 9 34 20 21
## 2013-Obama 20 17 41 12 14
## 2017-Trump 23 6 44 21 16
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 2) %>%
tail()
## Document-feature matrix of: 6 documents, 2 features (0.0% sparse).
## 6 x 2 sparse Matrix of class "dfm"
## features
## docs virtue vice
## 1997-Clinton 68 14
## 2001-Bush 107 13
## 2005-Bush 117 23
## 2009-Obama 95 13
## 2013-Obama 94 10
## 2017-Trump 95 15
指定两个级别(或默认 levels = 1:5
)与我们最初使用扁平化字典的内容相匹配:
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1:2) %>%
tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
## features
## docs care.virtue care.vice fairness.virtue fairness.vice
## 1997-Clinton 8 4 6 2
## 2001-Bush 21 8 11 1
## 2005-Bush 14 12 16 4
## 2009-Obama 18 6 8 1
## 2013-Obama 14 6 15 2
## 2017-Trump 16 7 2 4
## features
## docs loyalty.virtue loyalty.vice authority.virtue authority.vice
## 1997-Clinton 37 0 3 0
## 2001-Bush 36 1 18 2
## 2005-Bush 38 3 33 4
## 2009-Obama 33 1 18 2
## 2013-Obama 39 2 12 0
## 2017-Trump 44 0 20 1
## features
## docs sanctity.virtue sanctity.vice
## 1997-Clinton 14 8
## 2001-Bush 21 1
## 2005-Bush 16 0
## 2009-Obama 18 3
## 2013-Obama 14 0
## 2017-Trump 13 3
我正在通过 quanteda
包将字典读入 R。这个包预装了一些很棒的词典,其中之一是我感兴趣的道德基础词典。这本词典有几个类别(农场、公平、内部团体等),分为美德和副类别。
我想计算 R 中每个基础的每个子类别中的单词数。我该怎么做?
对于可重现的示例,我可以通过 运行 library(quanteda.dictionaries)
data_dictionary_MFD
)
谢谢!
不确定您的 MFD 语料库是什么样的;如果它是 osf.io/whjt2 上托管的,那么前六行将如下所示(mfd
作为数据集的名称,Wordtoken
和 MFDcategory
作为我的专栏headers):
head(mfd)
Wordtoken MFDcategory
1 compassion 1
2 empathy 1
3 kindness 1
4 caring 1
5 generosity 1
6 benevolence 1
如果您的目标只是找出 MFDcategory
的十个级别中的每个级别下列出了多少个单词,那么您所要做的就是对该栏使用 table
:
table(mfd$MFDcategory)
1 2 3 4 5 6 7 8 9 10
182 288 115 236 143 49 301 130 272 388
也就是说,类别 1 有 182 个单词标记,即 care.virtue,而类别 2 有 288 个单词标记,即 care.vice,依此类推。这有帮助吗?
不完全清楚您在寻找什么,但这可能归结为术语。 quanteda 词典对规范类别(在 R 中,列表元素的名称)使用 "keys" 的术语,对用于匹配单词的模式使用 "values"用于计算每个键的出现次数。
MFD有两组"keys":道德"foundations"如关怀、公平等,以及"valences"以"vice"和"virtue"为代表对于每个基金会类别。然而,正如我们在 quanteda.dictionaries::data_dictionary_MFD
中记录的那样 - 至少在 quanteda.dictionaries 的 v0.22 中 - 字典被扁平化到只有一个级别。
我们可以看到这一点,统计一下这里结合foundation和valence的各个字典"key"中的值,如下:
library("quanteda")
## Package version: 1.5.2
data(data_dictionary_MFD, package = "quanteda.dictionaries")
# number of "words" in each MFD dictionary key
lengths(data_dictionary_MFD)
## care.virtue care.vice fairness.virtue fairness.vice
## 182 288 115 236
## loyalty.virtue loyalty.vice authority.virtue authority.vice
## 142 49 301 130
## sanctity.virtue sanctity.vice
## 272 388
# first 5 values in each dictionary key
lapply(data_dictionary_MFD, head, 5)
## $care.virtue
## [1] "alleviate" "alleviated" "alleviates" "alleviating" "alleviation"
##
## $care.vice
## [1] "abused" "abuser" "abusers" "abuses" "abusing"
##
## $fairness.virtue
## [1] "avenge" "avenged" "avenger" "avengers" "avenges"
##
## $fairness.vice
## [1] "am partial" "bamboozle" "bamboozled" "bamboozles" "bamboozling"
##
## $loyalty.virtue
## [1] "all for one" "allegiance" "allegiances" "allegiant" "allied"
##
## $loyalty.vice
## [1] "against us" "apostate" "apostates" "backstab" "backstabbed"
##
## $authority.virtue
## [1] "acquiesce" "acquiesced" "acquiescent" "acquiesces" "acquiescing"
##
## $authority.vice
## [1] "anarchist" "anarchistic" "anarchists" "anarchy" "apostate"
##
## $sanctity.virtue
## [1] "abstinance" "abstinence" "allah" "almighty" "angel"
##
## $sanctity.vice
## [1] "abhor" "abhored" "abhors" "addict" "addicted"
要应用它来计算匹配 "key"(基础和价的组合)的单词,我们可以创建一个 dfm,然后使用 dfm_lookup()
:
# number of words in a text matching the MFD dictionary
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFD) %>%
tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
## features
## docs care.virtue care.vice fairness.virtue fairness.vice
## 1997-Clinton 8 4 6 2
## 2001-Bush 21 8 11 1
## 2005-Bush 14 12 16 4
## 2009-Obama 18 6 8 1
## 2013-Obama 14 6 15 2
## 2017-Trump 16 7 2 4
## features
## docs loyalty.virtue loyalty.vice authority.virtue authority.vice
## 1997-Clinton 37 0 3 0
## 2001-Bush 36 1 18 2
## 2005-Bush 38 3 33 4
## 2009-Obama 33 1 18 2
## 2013-Obama 39 2 12 0
## 2017-Trump 44 0 20 1
## features
## docs sanctity.virtue sanctity.vice
## 1997-Clinton 14 8
## 2001-Bush 21 1
## 2005-Bush 16 0
## 2009-Obama 18 3
## 2013-Obama 14 0
## 2017-Trump 13 3
但是有更好的方法利用MFD的嵌套结构,但是我们需要先修改字典对象使其嵌套。供货时,MFD 已经 "flattened"。我们想将其展开,以便基础形成第一级键,而化合价形成第二级键。然后,使用 tokens_lookup()
和 dfm_lookup()
中的 levels
参数,我们将能够选择我们在文本中计算匹配项的级别。
首先,重新创建字典以使其嵌套。
# remake the dictionary into nested catetgory of foundation and valence
data_dictionary_MFDnested <-
dictionary(list(
care = list(
virtue = data_dictionary_MFD[["care.virtue"]],
vice = data_dictionary_MFD[["care.vice"]]
),
fairness = list(
virtue = data_dictionary_MFD[["fairness.virtue"]],
vice = data_dictionary_MFD[["fairness.vice"]]
),
loyalty = list(
virtue = data_dictionary_MFD[["loyalty.virtue"]],
vice = data_dictionary_MFD[["loyalty.vice"]]
),
authority = list(
virtue = data_dictionary_MFD[["authority.virtue"]],
vice = data_dictionary_MFD[["authority.vice"]]
),
sanctity = list(
virtue = data_dictionary_MFD[["sanctity.virtue"]],
vice = data_dictionary_MFD[["sanctity.vice"]]
)
))
检查这个我们可以看到字典的详细信息:
lengths(data_dictionary_MFDnested)
## care fairness loyalty authority sanctity
## 2 2 2 2 2
lapply(data_dictionary_MFDnested, lengths)
## $care
## virtue vice
## 182 288
##
## $fairness
## virtue vice
## 115 236
##
## $loyalty
## virtue vice
## 142 49
##
## $authority
## virtue vice
## 301 130
##
## $sanctity
## virtue vice
## 272 388
现在我们可以将它应用到我们的文本中:
# now apply it to texts
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1) %>%
tail()
## Document-feature matrix of: 6 documents, 5 features (0.0% sparse).
## 6 x 5 sparse Matrix of class "dfm"
## features
## docs care fairness loyalty authority sanctity
## 1997-Clinton 12 8 37 3 22
## 2001-Bush 29 12 37 20 22
## 2005-Bush 26 20 41 37 16
## 2009-Obama 24 9 34 20 21
## 2013-Obama 20 17 41 12 14
## 2017-Trump 23 6 44 21 16
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 2) %>%
tail()
## Document-feature matrix of: 6 documents, 2 features (0.0% sparse).
## 6 x 2 sparse Matrix of class "dfm"
## features
## docs virtue vice
## 1997-Clinton 68 14
## 2001-Bush 107 13
## 2005-Bush 117 23
## 2009-Obama 95 13
## 2013-Obama 94 10
## 2017-Trump 95 15
指定两个级别(或默认 levels = 1:5
)与我们最初使用扁平化字典的内容相匹配:
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1:2) %>%
tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
## features
## docs care.virtue care.vice fairness.virtue fairness.vice
## 1997-Clinton 8 4 6 2
## 2001-Bush 21 8 11 1
## 2005-Bush 14 12 16 4
## 2009-Obama 18 6 8 1
## 2013-Obama 14 6 15 2
## 2017-Trump 16 7 2 4
## features
## docs loyalty.virtue loyalty.vice authority.virtue authority.vice
## 1997-Clinton 37 0 3 0
## 2001-Bush 36 1 18 2
## 2005-Bush 38 3 33 4
## 2009-Obama 33 1 18 2
## 2013-Obama 39 2 12 0
## 2017-Trump 44 0 20 1
## features
## docs sanctity.virtue sanctity.vice
## 1997-Clinton 14 8
## 2001-Bush 21 1
## 2005-Bush 16 0
## 2009-Obama 18 3
## 2013-Obama 14 0
## 2017-Trump 13 3