Poincare 嵌入:从 WordNet 构建传递闭包
Poincare embeddings: building transitive closures from WordNet
我想在 Poincaré Embeddings for Learning Hierarchical Representations 中复制图 2,即:来自 WordNet 的“哺乳动物”子树的 Poincare 嵌入。
首先,我构建了表示图所需的传递闭包。在 these docs and this SO answer 之后,我执行以下操作来构建关系:
from nltk.corpus import wordnet as wn
root = wn.synset('mammal.n.01')
words = list(set([w for s in root.closure(hyponyms) for w in s.lemma_names()]))
rname = root.name().split('.')[0]
closure = [(word, rname) for word in words]
然后我使用 Gensim's Poincare
model 来计算嵌入。给定 Gensim 文档中的示例关系,例如
relations = [('kangaroo', 'marsupial'), ('kangaroo', 'mammal'), ('gib', 'cat')]
我推断上位词需要在右边。这是模型拟合代码:
from gensim.models.poincare import PoincareModel
from gensim.viz.poincare import poincare_2d_visualization
model = PoincareModel(relations, size=2, negative=0)
model.train(epochs=50)
fig = poincare_2d_visualization(model, relations, 'WordNet Poincare embeddings')
fig.show()
但是,结果显然是不正确的,因为它看起来不像论文。我做错了什么?
我认为这里的主要问题源于这一行:
closure = [(word, rname) for word in words]
您正在生成一个列表,其中每个词仅与 rname
相关,即“哺乳动物”。也就是说,您只得到 ("columbian_mammoth", "mammal")
并且缺少中间步骤 ("columbian_mammoth", "mammoth"), ("mammoth", "elephant"), ("elephant", "proboscidean")
等等。
我建议使用递归函数 append_pairs
来解决这个问题。我还稍微微调了 PoincareModel
和 poincare_2d_visualization
的参数。
from nltk.corpus import wordnet as wn
from gensim.models.poincare import PoincareModel
from gensim.viz.poincare import poincare_2d_visualization
def simple_name(r):
return r.name().split('.')[0]
def append_pairs(my_root, pairs):
for w in my_root.hyponyms():
pairs.append((simple_name(w), simple_name(my_root)))
append_pairs(w, pairs)
return pairs
if __name__ == '__main__':
root = wn.synset('mammal.n.01')
words = list(set([w for s in root.closure(lambda s: s.hyponyms()) for w in s.lemma_names()]))
relations = append_pairs(root, [])
model = PoincareModel(relations, size=2, negative=10)
model.train(epochs=20)
fig = poincare_2d_visualization(model, relations, 'WordNet Poincare embeddings', num_nodes=None)
fig.show()
图像还没有原始来源中的那么漂亮,但至少你现在可以看到聚类了。
我想在 Poincaré Embeddings for Learning Hierarchical Representations 中复制图 2,即:来自 WordNet 的“哺乳动物”子树的 Poincare 嵌入。
首先,我构建了表示图所需的传递闭包。在 these docs and this SO answer 之后,我执行以下操作来构建关系:
from nltk.corpus import wordnet as wn
root = wn.synset('mammal.n.01')
words = list(set([w for s in root.closure(hyponyms) for w in s.lemma_names()]))
rname = root.name().split('.')[0]
closure = [(word, rname) for word in words]
然后我使用 Gensim's Poincare
model 来计算嵌入。给定 Gensim 文档中的示例关系,例如
relations = [('kangaroo', 'marsupial'), ('kangaroo', 'mammal'), ('gib', 'cat')]
我推断上位词需要在右边。这是模型拟合代码:
from gensim.models.poincare import PoincareModel
from gensim.viz.poincare import poincare_2d_visualization
model = PoincareModel(relations, size=2, negative=0)
model.train(epochs=50)
fig = poincare_2d_visualization(model, relations, 'WordNet Poincare embeddings')
fig.show()
但是,结果显然是不正确的,因为它看起来不像论文。我做错了什么?
我认为这里的主要问题源于这一行:
closure = [(word, rname) for word in words]
您正在生成一个列表,其中每个词仅与 rname
相关,即“哺乳动物”。也就是说,您只得到 ("columbian_mammoth", "mammal")
并且缺少中间步骤 ("columbian_mammoth", "mammoth"), ("mammoth", "elephant"), ("elephant", "proboscidean")
等等。
我建议使用递归函数 append_pairs
来解决这个问题。我还稍微微调了 PoincareModel
和 poincare_2d_visualization
的参数。
from nltk.corpus import wordnet as wn
from gensim.models.poincare import PoincareModel
from gensim.viz.poincare import poincare_2d_visualization
def simple_name(r):
return r.name().split('.')[0]
def append_pairs(my_root, pairs):
for w in my_root.hyponyms():
pairs.append((simple_name(w), simple_name(my_root)))
append_pairs(w, pairs)
return pairs
if __name__ == '__main__':
root = wn.synset('mammal.n.01')
words = list(set([w for s in root.closure(lambda s: s.hyponyms()) for w in s.lemma_names()]))
relations = append_pairs(root, [])
model = PoincareModel(relations, size=2, negative=10)
model.train(epochs=20)
fig = poincare_2d_visualization(model, relations, 'WordNet Poincare embeddings', num_nodes=None)
fig.show()
图像还没有原始来源中的那么漂亮,但至少你现在可以看到聚类了。