Biodomain 中的数据集类似于 word2vec 和 Glove 中使用的 Word 相似度数据集

Datasets in Biodomain like Word similarity datasets used in word2vec and Glove

我正在用生物医学文本训练 word2vec。为了执行单词相似度和单词类比测试，我想要一对具有相同关系（可以是任何关系）的生物医学术语，就像我们在 word2vec 中有完整的城邦数据列表一样。我尝试在网上搜索，但由于我是该领域的新手，所以我发现它很混乱。

那么，我在哪里可以找到与药物基因或蛋白质作用等相关的列表？或者我怎样才能挖掘这些数据。请建议公开可用的此类数据集。另外，请建议我也可以查询的任何其他有趣的关系。

另一种方法是使用可用的本体论，因为它们包括概念之间的关系，例如 has-part、is-a-way-of-doing、is-a-cause-of、is-a-symptom-of等。我可以使用本体来提取这样的对吗？如果是，那么什么本体论以及如何？

是否有可以满足我的目的的黄金标准数据集？

So, where can I find the list relevant to Drug-gene or Protein-action, etc?

看看 ChEMBL，例如aspirin is linked to its target cyclooxygenase

Another way would be to use available ontologies as they include relations between concepts such as has-part, is-a-way-of-doing, is-a-cause-of, is-a-symptom-of etc. Can I use ontologies to extract such pairs? If yes, then what ontologies and how?

好的开始是ChEBI ontology。

Biodomain 中的数据集类似于 word2vec 和 Glove 中使用的 Word 相似度数据集

Datasets in Biodomain like Word similarity datasets used in word2vec and Glove

nlp

bioinformatics

text-mining

biopython