根据字典计算相对频率

Computing relative frequencies based on dictionary

我想使用 R 中的计算机辅助文本分析来检查创始人的心理资本(一种由四个维度组成的结构,即希望、乐观、效能和弹性)。到目前为止,我已经从各种推文中提取了推文用户进入 R。数据框包含来自 5 个不同用户在不同时期的 2130 条推文。数据框称为 before_failure。 Picture of original data frame

然后我使用 quanteda 包创建了一个语料库,对其进行了标记化并删除了冗余 punctuatio/numbers/symbols:

#Creating a corpus
before_failure_corpus <- corpus(before_failure, text_field = "text")

#Tokenization, removing punctuation and numbers
tok_before_failure <- before_failure_corpus %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% 
  tokens_tolower()

之后我也使用 quanteda 包创建了一个字典(字典本身是由其他研究心理资本的作者创建的):


#Creating Dictionary with quanteda
dict <- dictionary(list(hope = c("Accomplishments", "Achievements", "Approach", "Aspiration", "Aspire", "Aspired",
                                 "Aspirer", "Aspires", "Aspiring", "Aspiringly", "Assurance", "Assurances", "Assure",
                                 "Assured", "Assuredly", "Assuredness", "Assuring", "Assuringly", "Assuringness", "Belief",
                                 "Believe", "Believed", "Believes", "Believing", "Breakthrough", "Certain", "Certainly",
                                 "Certainty", "Committed", "Concept", "Confidence", "Confident", "Confidently",
                                 "Convinced", "Dare say", "Deduce", "Deduced", "Deduces", "Deducing", "Desire",
                                 "Desired", "Desires", "Desiring", "Doubt not", "Energy", "Engage", "Engagement",
                                 "Expectancy", "Faith", "Foresaw", "Foresee", "Foreseeing", "Foreseen", "Foresees", "Goal",
                                 "Goals", "Hearten", "Heartened", "Heartening", "Hearteningly", "Heartens", "Hope",
                                 "Hoped", "Hopeful", "Hopefully", "Hopefulness", "Hoper", "Hopes", "Hoping", "Idea",
                                 "Innovation", "Innovative", "Ongoing", "Opportunity", "Promise", "Promising",
                                 "Propitious", "Propitiously", "Propitiousness", "Solution", "Solutions", "Upbeat",
                                 "Wishes", "Wishing", "Yearn", "Yearn for", "Yearning", "Yearning for", "Yearns for"),
                       efficacy = c("Ability", "Accomplish", "Accomplished", "Accomplishes", "Accomplishing",
                                    "Accomplishments", "Achievements", "Achieving", "Adept", "Adeptly", "Adeptness",
                                    "Adroitly", "Adroitness", "All-in", "Aplomb", "Arrogance", "Arrogant", "Arrogantly",
                                    "Assurance", "Assured", "Assuredly", "Assuredness", "Backbone", "Bandwidth", "Belief",
                                    "Capable", "Capableness", "Capably", "Certain", "Certainly", "Certainness", "Certainty",
                                    "Certitude", "Cocksurely", "Cocksureness", "Cocky", "Commitment", "Commitments",
                                    "Committed", "Compelling", "Competence", "Competency", "Competent", "Competently",
                                    "Confidence", "Confident", "Confidently", "Conviction", "Effective", "Effectively",
                                    "Effectiveness", "Effectual", "Effectually", "Effectualness", "Efficacious", "Efficaciously",
                                    "Efficaciousness", "Efficacy", "Equanimity", "Equanimous", "Equanimously", "Expertise",
                                    "Expertly", "Fortitude", "Fortitudinous", "Forward", "Forwardness", "Know-how",
                                    "Knowledgability", "Knowledgeable", "Knowledgably", "Masterful", "Masterfully", "Masterfulness",
                                    "Masterly", "Mastery", "Overconfidence", "Overconfident", "Overconfidently",
                                    "Persuasion", "Power", "Powerful", "Powerfully", "Powerfulness", "Prevailed",
                                    "Prevailing", "Prevails", "Prevalence", "Prevalent", "Reassurance", "Reassure", "Reassured",
                                    "Reassures", "Reassuring", "Self-assurance", "Self-assured", "Self-assuring", "Selfconfidence",
                                    "Self-confident", "Self-dependence", "Self-dependent", "Self-reliance",
                                    "Self-reliant", "Stamina", "Steadily", "Steadiness", "Steady", "Strength", "Strong", "Stronger",
                                    "Strongish", "Strongly", "Strongness", "Superior", "Superiority", "Sure", "Surely", "Sureness",
                                    "Unblinking", "Unblinkingly", "Undoubtedly", "Undoubting", "Unflappability", "Unflappable",
                                    "Unflinching", "Unflinchingly", "Unhesitating", "Unhesitatingly", "Unwavering",
                                    "Unwaveringly"),
                       resiliency = c("Adamant", "Adamantly", "Assiduous", "Assiduously", "Assiduousness", "Backbone",
                                      "Bandwidth", "Bears up", "Bounce", "Bounced", "Bounces", "Bouncing", "Buoyant",
                                      "Commitment", "Commitments", "Committed", "Consistent", "Determination",
                                      "Determined", "Determinedly", "Determinedness", "Devoted", "Devotedly",
                                      "Devotedness", "Devotion", "Die trying", "Died trying", "Dies trying", "Disciplined",
                                      "Dogged", "Doggedly", "Doggedness", "Drudge", "Drudged", "Drudges", "Endurance",
                                      "Endure", "Endured", "Endures", "Enduring", "Grit", "Hammer away", "Hammered away",
                                      "Hammering away", "Hammers away", "Held fast", "Held good", "Held up", "Hold fast",
                                      "Holding fast", "Holding up", "Holds fast", "Holds good", "Immovability", "Immovable",
                                      "Immovably", "Indefatigable", "Indefatigableness", "Indefatigably", "Indestructibility",
                                      "Indestructible", "Indestructibleness", "Indestructibly", "Intransigence", "Intransigency",
                                      "Intransigent", "Keep at", "Keep going", "Keep on", "Keeping at", "Keeping going",
                                      "Keeping on", "Keeps at", "Keeps going", "Keeps on", "Kept at", "Kept going", "Kept on",
                                      "Labored", "Laboring", "Never-tiring", "Never-wearying", "Perdure", "Perdured", "Perduring",
                                      "Perseverance", "Persevere", "Persevered", "Persevering", "Persist", "Persisted",
                                      "Persistence", "Persistent", "Persisting", "Pertinacious", "Pertinaciously", "Pertinacity",
                                      "Rebound", "Rebounded", "Rebounding", "Rebounds", "Relentlessness", "Remain",
                                      "Remained", "Remaining", "Remains", "Resilience", "Resiliency", "Resilient", "Resolute",
                                      "Resolutely", "Resoluteness", "Resolve", "Resolved", "Resolves", "Resolving", "Robust",
                                      "Sedulity", "Sedulous", "Sedulously", "Sedulousness", "Snap back", "Snapped back",
                                      "Snapping back", "Snaps back", "Spring back", "Springing back", "Springs", "Springs back",
                                      "Sprung back", "Stalwart", "Stalwartly", "Stalwartness", "Stand fast", "Stand firm", "Standingfast",
                                      "Standing firm", "Stands fast", "Stands firm", "Stay", "Steadfast", "Steadfastly",
                                      "Steadfastness", "Stood fast", "Stood firm", "Strove", "Survive", "Surviving", "Surviving",
                                      "Tenacious", "Tenaciously", "Tenaciousness", "Tenacity", "Tough", "Uncompromising",
                                      "Uncompromisingly", "Uncompromisingness", "Unfaltering", "Unfalteringly", "Unflagging",
                                      "Unrelenting", "Unrelentingly", "Unrelentingness", "Unshakable", "Unshakablely",
                                      "Unshakeable", "Unshaken", "Unshaking", "Unswervable", "Unswerved", "Unswerving",
                                      "Unswervingly", "Unswervingness", "Untiring", "Unwavered", "Unwavering", "Unweariedness",
                                      "Unyielding", "Unyieldingly", "Unyieldingness", "Upheld", "Uphold", "Upholding",
                                      "Upholds", "Zeal", "Zealous", "Zealously", "Zealousness"),
                       optimism = c("Aspire", "Aspirer", "Aspires", "Aspiring", "Aspiringly", "Assurance", "Assured", "Assuredly",
                                    "Assuredness", "Assuring", "Auspicious", "Auspiciously", "Auspiciousness", "Bank on",
                                    "Beamish", "Believe", "Believed", "Believes", "Believing", "Bullish", "Bullishly", "Bullishness",
                                    "Confidence", "Confident", "Confidently", "Encourage", "Encouraged", "Encourages",
                                    "Encouraging", "Encouragingly", "Ensuring", "Expectancy", "Expectant", "Expectation",
                                    "Expectations", "Expected", "Expecting", "Faith", "Good omen", "Hearten", "Heartened",
                                    "Heartener", "Heartening", "Hearteningly", "Heartens", "Hope", "Hoped", "Hopeful",
                                    "Hopefully", "Hopefulness", "Hoper", "Hopes", "Hoping", "Ideal", "Idealist", "Idealistic",
                                    "Idealistically", "Ideally", "Looking up", "Looks up", "Optimism", "Optimist", "Optimistic",
                                    "Optimistical", "Optimistically", "Outlook", "Positive", "Positively", "Positiveness",
                                    "Positivity", "Promising", "Propitious", "Propitiously", "Propitiousness", "Reassure",
                                    "Reassured", "Reassures", "Reassuring", "Roseate", "Rosy", "Sanguine", "Sanguinely",
                                    "Sanguineness", "Sanguinity", "Sunniness", "Sunny")))

现在我想通过将反映 Psycap 四个维度的推文中使用的单词数除以语料库中的单词总数来计算相对频率。不幸的是我被困在了这一点上。最后我想要一个看起来像这样的 table (值是组成的):

 dimensions Frequency
1       hope      0.36
2   optimism      0.50
3   Efficacy      0.22
4 Resiliency      0.10

我希望我的解释足够了,如果没有,请不要犹豫。 谢谢

最简单的方法是将 tokens_lookup() 与不匹配标记的类别一起使用,然后将其编译成 dfm,然后将其转换为文档中的术语比例。

要使用来自 built-in quanteda 对象的可重现示例,过程如下。 (您可以替换自己的语料库和词典,代码应该可以正常工作。)

library("quanteda")
## Package version: 3.2
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

tok_before_failure <- tokens(tail(data_corpus_inaugural, 5))
dict <- data_dictionary_LSD2015[1:2]

tokens_lookup(tok_before_failure, data_dictionary_LSD2015[1:2], nomatch = "other") %>%
  dfm() %>%
  dfm_weight(scheme = "prop")
## Document-feature matrix of: 5 documents, 3 features (0.00% sparse) and 4 docvars.
##             features
## docs           negative   positive     other
##   2005-Bush  0.03719723 0.09169550 0.8711073
##   2009-Obama 0.04428731 0.07182732 0.8838854
##   2013-Obama 0.03366422 0.07337074 0.8929650
##   2017-Trump 0.02831325 0.07409639 0.8975904
##   2021-Biden 0.04049168 0.06182213 0.8976862