如何执行 ngram 到 ngram 关联

Question

有人可以指出正确的方向来解决以下问题吗？

我有一长串来自 UMLS 的医学术语，即样本可能是

Disease control is good
Disease control is poor
Disease control is excellent
Drug adherence
Current drug
Sodium Valproate
Antibiotic VI
Epilepsy control is good
Frequent seizures
Clinically isolated syndrome
Fractured patella
Fractured femur

我还有另一个短语列表，它们与字符串不完全匹配，但相似，即

Good control of epilepsy    -->      Epilepsy control is good
Broken tibia                -->      Fractured tibia
Currently prescribed drugs  -->      Current drugs

我基本上想从我的第二个短语列表中获得与第一个短语列表的最佳匹配。

我知道 ngram 搭配，但这似乎是从单个文本语料库中找到排名靠前的 ngram，而不是将一个 ngram 与另一个 ngram 相关联。

我是否需要查看字符串匹配算法或更多基于机器学习的方法？

是否有人知道可以执行此操作的任何软件包 - 我查看了 python NLTK，但找不到此类功能。

谢谢

Answer 1

我个人会首先将 Levenshtein distance 视为一种可能运作良好的基本且简单的方法。我会先阻止这些词，然后运行 Levenshtein。

一种更复杂的方法是使用已经训练好的 word2vec 模型（在 Spark 和 NLTK), and then aggregate vectors of words that appear in each ngram to generate vectors for ngrams. Finally, you can compare the resulting vectors and find the most similar pairs. There are libraries out there that allows you to generate these aggregated vector representations for ngrams and documents 中可用）。您还可以找到相关文章，并根据您的具体情况提出并实施您自己的聚合方法需要。

如何执行 ngram 到 ngram 关联

How to perform ngram to ngram association

python

nlp

machine-learning

associations

n-gram