为什么 "machine_learning" 被词形化为 "machine_learning" 和 "machine_learne"?
Why is "machine_learning" lemmatized both as "machine_learning" and "machine_learne"?
我 运行 LDA 在许多文本上。当我对生成的主题进行一些可视化时,我发现二元语法 "machine_learning" 已被词形还原为 "machine_learning" 和 "machine_learne"。这是我可以提供的最小可重现示例:
import en_core_web_sm
tokenized = [
[
'artificially_intelligent', 'funds', 'generating', 'excess', 'returns',
'artificial_intelligence', 'deep_learning', 'compelling', 'reasons',
'join_us', 'artificially_intelligent', 'fund', 'develop', 'ai',
'machine_learning', 'capabilities', 'real', 'cases', 'big', 'players',
'industry', 'discover', 'emerging', 'trends', 'latest_developments',
'ai', 'machine_learning', 'industry', 'players', 'trading',
'investing', 'live', 'investment', 'models', 'learn', 'develop',
'compelling', 'business', 'case', 'clients', 'ceos', 'adopt', 'ai',
'machine_learning', 'investment', 'approaches', 'rare', 'gathering',
'talents', 'including', 'quants', 'data_scientists', 'researchers',
'ai', 'machine_learning', 'experts', 'investment_officers', 'explore',
'solutions', 'challenges', 'potential', 'risks', 'pitfalls',
'adopting', 'ai', 'machine_learning'
],
[
'recent_years', 'topics', 'data_science', 'artificial_intelligence',
'machine_learning', 'big_data', 'become_increasingly', 'popular',
'growth', 'fueled', 'collection', 'availability', 'data',
'continually', 'increasing', 'processing', 'power', 'storage', 'open',
'source', 'movement', 'making', 'tools', 'widely', 'available',
'result', 'already', 'witnessed', 'profound', 'changes', 'work',
'rest', 'play', 'trend', 'increase', 'world', 'finance', 'impacted',
'investment', 'managers', 'particular', 'join_us', 'explore',
'data_science', 'means', 'finance_professionals'
]
]
nlp = en_core_web_sm.load(disable=['parser', 'ner'])
def lemmatization(descrips, allowed_postags=None):
if allowed_postags is None:
allowed_postags = ['NOUN', 'ADJ', 'VERB',
'ADV']
lemmatized_descrips = []
for descrip in descrips:
doc = nlp(" ".join(descrip))
lemmatized_descrips.append([
token.lemma_ for token in doc if token.pos_ in allowed_postags
])
return lemmatized_descrips
lemmatized = lemmatization(tokenized)
print(lemmatized)
您会注意到,在输入 tokenized
中找不到 "machine_learne",但在输出 lemmatized
中可以找到 "machine_learning" 和 "machine_learne" .
这是什么原因,我可以预期它会导致其他 bigrams/trigrams 出现问题吗?
我认为你误解了词性标注和词形还原的过程。
词性标注 是基于其他几个信息而不是单独的单词(我不知道你的母语是哪一种,但这对许多语言来说都是通用的),而且在周围的词上(例如,一个常见的学习规则是,在许多语句中,动词通常前面有一个名词,代表动词的代理)。
当您将所有这些 'tokens' 传递给您的词形还原器时,spacy 的词形还原器将尝试 "guess" 这是您单独单词的词性。
在许多情况下,它会使用默认名词,如果它不在 table 普通名词和不规则名词的查找中,它会尝试使用通用规则(例如剥离复数 's').
在其他情况下,它会根据某些模式(最后的“-ing”)选择默认动词,这可能是您的情况。由于任何字典中都不存在动词 "machine_learning"(其模型中没有实例),它将采用 "else" 路由并应用通用规则。
因此,machine_learning 可能被一般的 '"ing" 到 "e"' 规则还原(例如在making -> make, baking -> bake),常见于许多规则动词。
看这个测试例子:
for descrip in tokenized:
doc = nlp(" ".join(descrip))
print([
(token.pos_, token.text) for token in doc
])
输出:
[('NOUN', 'artificially_intelligent'), ('NOUN', 'funds'), ('VERB',
'generating'), ('ADJ', 'excess'), ('NOUN', 'returns'), ('NOUN',
'artificial_intelligence'), ('NOUN', 'deep_learning'), ('ADJ',
'compelling'), ('NOUN', 'reasons'), ('PROPN', 'join_us'), ('NOUN',
'artificially_intelligent'), ('NOUN', 'fund'), ('NOUN', 'develop'),
('VERB', 'ai'), ('VERB', 'machine_learning'), ('NOUN',
'capabilities'), ('ADJ', 'real'), ('NOUN', 'cases'), ('ADJ', 'big'),
('NOUN', 'players'), ('NOUN', 'industry'), ('VERB', 'discover'),
('VERB', 'emerging'), ('NOUN', 'trends'), ('NOUN',
'latest_developments'), ('VERB', 'ai'), ('VERB', 'machine_learning'),
('NOUN', 'industry'), ('NOUN', 'players'), ('NOUN', 'trading'),
('VERB', 'investing'), ('ADJ', 'live'), ('NOUN', 'investment'),
('NOUN', 'models'), ('VERB', 'learn'), ('VERB', 'develop'), ('ADJ',
'compelling'), ('NOUN', 'business'), ('NOUN', 'case'), ('NOUN',
'clients'), ('NOUN', 'ceos'), ('VERB', 'adopt'), ('VERB', 'ai'),
('ADJ', 'machine_learning'), ('NOUN', 'investment'), ('NOUN',
'approaches'), ('ADJ', 'rare'), ('VERB', 'gathering'), ('NOUN',
'talents'), ('VERB', 'including'), ('NOUN', 'quants'), ('NOUN',
'data_scientists'), ('NOUN', 'researchers'), ('VERB', 'ai'), ('ADJ',
'machine_learning'), ('NOUN', 'experts'), ('NOUN',
'investment_officers'), ('VERB', 'explore'), ('NOUN', 'solutions'),
('VERB', 'challenges'), ('ADJ', 'potential'), ('NOUN', 'risks'),
('NOUN', 'pitfalls'), ('VERB', 'adopting'), ('VERB', 'ai'), ('NOUN',
'machine_learning')]
根据上下文,您将 machine_learning 用作动词和名词。但是请注意,仅仅连接单词会让你变得一团糟,因为它们没有按预期在自然语言中排序。
即使是人类也无法理解并正确地对这段文字进行 POS 标记:
artificially_intelligent funds generating excess returns
artificial_intelligence deep_learning compelling reasons join_us
artificially_intelligent fund develop ai machine_learning capabilities
real cases big players industry discover emerging trends
latest_developments ai machine_learning industry players trading
investing live investment models learn develop compelling business
case clients ceos adopt ai machine_learning investment approaches rare
gathering talents including quants data_scientists researchers ai
machine_learning experts investment_officers explore solutions
challenges potential risks pitfalls adopting ai machine_learning
我 运行 LDA 在许多文本上。当我对生成的主题进行一些可视化时,我发现二元语法 "machine_learning" 已被词形还原为 "machine_learning" 和 "machine_learne"。这是我可以提供的最小可重现示例:
import en_core_web_sm
tokenized = [
[
'artificially_intelligent', 'funds', 'generating', 'excess', 'returns',
'artificial_intelligence', 'deep_learning', 'compelling', 'reasons',
'join_us', 'artificially_intelligent', 'fund', 'develop', 'ai',
'machine_learning', 'capabilities', 'real', 'cases', 'big', 'players',
'industry', 'discover', 'emerging', 'trends', 'latest_developments',
'ai', 'machine_learning', 'industry', 'players', 'trading',
'investing', 'live', 'investment', 'models', 'learn', 'develop',
'compelling', 'business', 'case', 'clients', 'ceos', 'adopt', 'ai',
'machine_learning', 'investment', 'approaches', 'rare', 'gathering',
'talents', 'including', 'quants', 'data_scientists', 'researchers',
'ai', 'machine_learning', 'experts', 'investment_officers', 'explore',
'solutions', 'challenges', 'potential', 'risks', 'pitfalls',
'adopting', 'ai', 'machine_learning'
],
[
'recent_years', 'topics', 'data_science', 'artificial_intelligence',
'machine_learning', 'big_data', 'become_increasingly', 'popular',
'growth', 'fueled', 'collection', 'availability', 'data',
'continually', 'increasing', 'processing', 'power', 'storage', 'open',
'source', 'movement', 'making', 'tools', 'widely', 'available',
'result', 'already', 'witnessed', 'profound', 'changes', 'work',
'rest', 'play', 'trend', 'increase', 'world', 'finance', 'impacted',
'investment', 'managers', 'particular', 'join_us', 'explore',
'data_science', 'means', 'finance_professionals'
]
]
nlp = en_core_web_sm.load(disable=['parser', 'ner'])
def lemmatization(descrips, allowed_postags=None):
if allowed_postags is None:
allowed_postags = ['NOUN', 'ADJ', 'VERB',
'ADV']
lemmatized_descrips = []
for descrip in descrips:
doc = nlp(" ".join(descrip))
lemmatized_descrips.append([
token.lemma_ for token in doc if token.pos_ in allowed_postags
])
return lemmatized_descrips
lemmatized = lemmatization(tokenized)
print(lemmatized)
您会注意到,在输入 tokenized
中找不到 "machine_learne",但在输出 lemmatized
中可以找到 "machine_learning" 和 "machine_learne" .
这是什么原因,我可以预期它会导致其他 bigrams/trigrams 出现问题吗?
我认为你误解了词性标注和词形还原的过程。
词性标注 是基于其他几个信息而不是单独的单词(我不知道你的母语是哪一种,但这对许多语言来说都是通用的),而且在周围的词上(例如,一个常见的学习规则是,在许多语句中,动词通常前面有一个名词,代表动词的代理)。
当您将所有这些 'tokens' 传递给您的词形还原器时,spacy 的词形还原器将尝试 "guess" 这是您单独单词的词性。
在许多情况下,它会使用默认名词,如果它不在 table 普通名词和不规则名词的查找中,它会尝试使用通用规则(例如剥离复数 's').
在其他情况下,它会根据某些模式(最后的“-ing”)选择默认动词,这可能是您的情况。由于任何字典中都不存在动词 "machine_learning"(其模型中没有实例),它将采用 "else" 路由并应用通用规则。
因此,machine_learning 可能被一般的 '"ing" 到 "e"' 规则还原(例如在making -> make, baking -> bake),常见于许多规则动词。
看这个测试例子:
for descrip in tokenized:
doc = nlp(" ".join(descrip))
print([
(token.pos_, token.text) for token in doc
])
输出:
[('NOUN', 'artificially_intelligent'), ('NOUN', 'funds'), ('VERB', 'generating'), ('ADJ', 'excess'), ('NOUN', 'returns'), ('NOUN', 'artificial_intelligence'), ('NOUN', 'deep_learning'), ('ADJ', 'compelling'), ('NOUN', 'reasons'), ('PROPN', 'join_us'), ('NOUN', 'artificially_intelligent'), ('NOUN', 'fund'), ('NOUN', 'develop'), ('VERB', 'ai'), ('VERB', 'machine_learning'), ('NOUN', 'capabilities'), ('ADJ', 'real'), ('NOUN', 'cases'), ('ADJ', 'big'), ('NOUN', 'players'), ('NOUN', 'industry'), ('VERB', 'discover'), ('VERB', 'emerging'), ('NOUN', 'trends'), ('NOUN', 'latest_developments'), ('VERB', 'ai'), ('VERB', 'machine_learning'), ('NOUN', 'industry'), ('NOUN', 'players'), ('NOUN', 'trading'), ('VERB', 'investing'), ('ADJ', 'live'), ('NOUN', 'investment'), ('NOUN', 'models'), ('VERB', 'learn'), ('VERB', 'develop'), ('ADJ', 'compelling'), ('NOUN', 'business'), ('NOUN', 'case'), ('NOUN', 'clients'), ('NOUN', 'ceos'), ('VERB', 'adopt'), ('VERB', 'ai'), ('ADJ', 'machine_learning'), ('NOUN', 'investment'), ('NOUN', 'approaches'), ('ADJ', 'rare'), ('VERB', 'gathering'), ('NOUN', 'talents'), ('VERB', 'including'), ('NOUN', 'quants'), ('NOUN', 'data_scientists'), ('NOUN', 'researchers'), ('VERB', 'ai'), ('ADJ', 'machine_learning'), ('NOUN', 'experts'), ('NOUN', 'investment_officers'), ('VERB', 'explore'), ('NOUN', 'solutions'), ('VERB', 'challenges'), ('ADJ', 'potential'), ('NOUN', 'risks'), ('NOUN', 'pitfalls'), ('VERB', 'adopting'), ('VERB', 'ai'), ('NOUN', 'machine_learning')]
根据上下文,您将 machine_learning 用作动词和名词。但是请注意,仅仅连接单词会让你变得一团糟,因为它们没有按预期在自然语言中排序。
即使是人类也无法理解并正确地对这段文字进行 POS 标记:
artificially_intelligent funds generating excess returns artificial_intelligence deep_learning compelling reasons join_us artificially_intelligent fund develop ai machine_learning capabilities real cases big players industry discover emerging trends latest_developments ai machine_learning industry players trading investing live investment models learn develop compelling business case clients ceos adopt ai machine_learning investment approaches rare gathering talents including quants data_scientists researchers ai machine_learning experts investment_officers explore solutions challenges potential risks pitfalls adopting ai machine_learning