如何将具有相同含义的派生词分类为相同的标记？

Question

我想统计一篇文章中不相关的词，但我很难将彼此派生的相同含义的词分组。

例如，我希望 gasoline 和 gas 在 The price of gasoline has risen. 和 "Gas" is a colloquial form of the word gasoline in North American English. Conversely, in BE the term would be "petrol". 这样的句子中被视为相同的标记因此，如果这两个句子包含整篇文章，gas（或gasoline）的计数将是 3（petrol 不会被计算在内）。

我尝试过使用 NLTK 的词干分析器和词形还原器，但无济于事。大多数似乎将 gas 重现为 gas，将 gasoline 重现为 gasolin，这对我的目的根本没有帮助。我知道这是通常的行为。我检查了一个似乎有点相似，但是那里的答案并不完全适用于我的情况，因为我要求这些词是相互派生的。

如何将具有相同含义的派生词视为相同标记以便将它们一起计算？

Answer 1

我建议采用两步法：

首先，通过比较词嵌入（仅非停用词）来查找同义词。这应该删除类似的书面文字，这意味着其他东西，例如 gasoline 和 gaseous。

然后，检查同义词是否共享某些词干。本质上是 if "gas" is in "gasolin"，反之亦然。这就足够了，因为您只比较同义词。

import spacy
import itertools
from nltk.stem.porter import *
threshold = 0.6

#compare the stems of the synonyms
stemmer = PorterStemmer()
def compare_stems(a, b):
  if stemmer.stem(a) in stemmer.stem(b):
    return True
  if stemmer.stem(b) in stemmer.stem(a):
    return True
  return False

candidate_synonyms = {}
#add a candidate to the candidate dictionary of sets
def add_to_synonym_dict(a,b):
  if a not in candidate_synonyms:
    if b not in candidate_synonyms:
      candidate_synonyms[a] = {a, b}
      return
    a, b = b,a
  candidate_synonyms[a].add(b)

nlp = spacy.load('en_core_web_lg') 

text = u'The price of gasoline has risen. "Gas" is a colloquial form of the word gasoline in North American English. Conversely in BE the term would be petrol. A gaseous state has nothing to do with oil.'

words = nlp(text)

#compare every word with every other word, if they are similar
for a, b in itertools.combinations(words, 2):
  #check if one of the word pairs are stopwords or punctuation
  if a.is_stop or b.is_stop or a.is_punct or b.is_punct:
    continue
  if a.similarity(b) > threshold:
    if compare_stems(a.text.lower(), b.text.lower()):
      add_to_synonym_dict(a.text.lower(), b.text.lower())



print(candidate_synonyms)
#output: {'gasoline': {'gas', 'gasoline'}}

然后你可以根据他们在文本中的出现来计算你的候选同义词。

注：我偶然选择了0.6的同义词阈值。您可能会测试哪个阈值适合您的任务。此外，我的代码只是一个快速而肮脏的示例，这可以做得更干净。 `

如何将具有相同含义的派生词分类为相同的标记？

How to classify derived words that share meaning as the same tokens?

python

nlp

text-mining

nltk