为什么 "machine_learning" 被词形化为 "machine_learning" 和 "machine_learne"？

Question

我运行 LDA 在许多文本上。当我对生成的主题进行一些可视化时，我发现二元语法 "machine_learning" 已被词形还原为 "machine_learning" 和 "machine_learne"。这是我可以提供的最小可重现示例：

import en_core_web_sm

tokenized = [
    [
        'artificially_intelligent', 'funds', 'generating', 'excess', 'returns',
        'artificial_intelligence', 'deep_learning', 'compelling', 'reasons',
        'join_us', 'artificially_intelligent', 'fund', 'develop', 'ai',
        'machine_learning', 'capabilities', 'real', 'cases', 'big', 'players',
        'industry', 'discover', 'emerging', 'trends', 'latest_developments',
        'ai', 'machine_learning', 'industry', 'players', 'trading',
        'investing', 'live', 'investment', 'models', 'learn', 'develop',
        'compelling', 'business', 'case', 'clients', 'ceos', 'adopt', 'ai',
        'machine_learning', 'investment', 'approaches', 'rare', 'gathering',
        'talents', 'including', 'quants', 'data_scientists', 'researchers',
        'ai', 'machine_learning', 'experts', 'investment_officers', 'explore',
        'solutions', 'challenges', 'potential', 'risks', 'pitfalls',
        'adopting', 'ai', 'machine_learning'
    ],
    [
        'recent_years', 'topics', 'data_science', 'artificial_intelligence',
        'machine_learning', 'big_data', 'become_increasingly', 'popular',
        'growth', 'fueled', 'collection', 'availability', 'data',
        'continually', 'increasing', 'processing', 'power', 'storage', 'open',
        'source', 'movement', 'making', 'tools', 'widely', 'available',
        'result', 'already', 'witnessed', 'profound', 'changes', 'work',
        'rest', 'play', 'trend', 'increase', 'world', 'finance', 'impacted',
        'investment', 'managers', 'particular', 'join_us', 'explore',
        'data_science', 'means', 'finance_professionals'
    ]
]

nlp = en_core_web_sm.load(disable=['parser', 'ner'])

def lemmatization(descrips, allowed_postags=None):
    if allowed_postags is None:
        allowed_postags = ['NOUN', 'ADJ', 'VERB',
                           'ADV']
    lemmatized_descrips = []
    for descrip in descrips:
        doc = nlp(" ".join(descrip))
        lemmatized_descrips.append([
            token.lemma_ for token in doc if token.pos_ in allowed_postags
        ])
    return lemmatized_descrips

lemmatized = lemmatization(tokenized)

print(lemmatized)

您会注意到，在输入 tokenized 中找不到 "machine_learne"，但在输出 lemmatized 中可以找到 "machine_learning" 和 "machine_learne" .

这是什么原因，我可以预期它会导致其他 bigrams/trigrams 出现问题吗？

Answer 1

我认为你误解了词性标注和词形还原的过程。

词性标注 是基于其他几个信息而不是单独的单词（我不知道你的母语是哪一种，但这对许多语言来说都是通用的），而且在周围的词上（例如，一个常见的学习规则是，在许多语句中，动词通常前面有一个名词，代表动词的代理）。

当您将所有这些 'tokens' 传递给您的词形还原器时，spacy 的词形还原器将尝试 "guess" 这是您单独单词的词性。

在许多情况下，它会使用默认名词，如果它不在 table 普通名词和不规则名词的查找中，它会尝试使用通用规则（例如剥离复数 's').

在其他情况下，它会根据某些模式（最后的“-ing”）选择默认动词，这可能是您的情况。由于任何字典中都不存在动词 "machine_learning"（其模型中没有实例），它将采用 "else" 路由并应用通用规则。

因此，machine_learning 可能被一般的 '"ing" 到 "e"' 规则还原（例如在making -> make, baking -> bake)，常见于许多规则动词。

看这个测试例子：

for descrip in tokenized:
        doc = nlp(" ".join(descrip))
        print([
            (token.pos_, token.text) for token in doc
        ])

输出：

[('NOUN', 'artificially_intelligent'), ('NOUN', 'funds'), ('VERB', 'generating'), ('ADJ', 'excess'), ('NOUN', 'returns'), ('NOUN', 'artificial_intelligence'), ('NOUN', 'deep_learning'), ('ADJ', 'compelling'), ('NOUN', 'reasons'), ('PROPN', 'join_us'), ('NOUN', 'artificially_intelligent'), ('NOUN', 'fund'), ('NOUN', 'develop'), ('VERB', 'ai'), ('VERB', 'machine_learning'), ('NOUN', 'capabilities'), ('ADJ', 'real'), ('NOUN', 'cases'), ('ADJ', 'big'), ('NOUN', 'players'), ('NOUN', 'industry'), ('VERB', 'discover'), ('VERB', 'emerging'), ('NOUN', 'trends'), ('NOUN', 'latest_developments'), ('VERB', 'ai'), ('VERB', 'machine_learning'), ('NOUN', 'industry'), ('NOUN', 'players'), ('NOUN', 'trading'), ('VERB', 'investing'), ('ADJ', 'live'), ('NOUN', 'investment'), ('NOUN', 'models'), ('VERB', 'learn'), ('VERB', 'develop'), ('ADJ', 'compelling'), ('NOUN', 'business'), ('NOUN', 'case'), ('NOUN', 'clients'), ('NOUN', 'ceos'), ('VERB', 'adopt'), ('VERB', 'ai'), ('ADJ', 'machine_learning'), ('NOUN', 'investment'), ('NOUN', 'approaches'), ('ADJ', 'rare'), ('VERB', 'gathering'), ('NOUN', 'talents'), ('VERB', 'including'), ('NOUN', 'quants'), ('NOUN', 'data_scientists'), ('NOUN', 'researchers'), ('VERB', 'ai'), ('ADJ', 'machine_learning'), ('NOUN', 'experts'), ('NOUN', 'investment_officers'), ('VERB', 'explore'), ('NOUN', 'solutions'), ('VERB', 'challenges'), ('ADJ', 'potential'), ('NOUN', 'risks'), ('NOUN', 'pitfalls'), ('VERB', 'adopting'), ('VERB', 'ai'), ('NOUN', 'machine_learning')]

根据上下文，您将 machine_learning 用作动词和名词。但是请注意，仅仅连接单词会让你变得一团糟，因为它们没有按预期在自然语言中排序。

即使是人类也无法理解并正确地对这段文字进行 POS 标记：

artificially_intelligent funds generating excess returns artificial_intelligence deep_learning compelling reasons join_us artificially_intelligent fund develop ai machine_learning capabilities real cases big players industry discover emerging trends latest_developments ai machine_learning industry players trading investing live investment models learn develop compelling business case clients ceos adopt ai machine_learning investment approaches rare gathering talents including quants data_scientists researchers ai machine_learning experts investment_officers explore solutions challenges potential risks pitfalls adopting ai machine_learning

为什么 "machine_learning" 被词形化为 "machine_learning" 和 "machine_learne"？

Why is "machine_learning" lemmatized both as "machine_learning" and "machine_learne"?

nlp

machine-learning

lemmatization

topic-modeling

spacy