如何在 CoreNLP 中使用共指结果迭代标记属性？

Question

我正在寻找一种从 CoreNLP 中提取和合并注释结果的方法。要指定，

import stanza
import os
from stanza.server import CoreNLPClient
corenlp_dir = '/Users/fatih/stanford-corenlp-4.2.0/'
os.environ['CORENLP_HOME'] = corenlp_dir

client = CoreNLPClient(
    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner', 'coref'], 
    memory='4G', 
    endpoint='http://localhost:9001',
    be_quiet=True)

text = "Barack Obama was born in Hawaii.  He is the president. Obama was elected in 2008."

doc = client.annotate(text)

for x in doc.corefChain:
    for y in x.mention:
        print(y.animacy)
        
ANIMATE
ANIMATE
ANIMATE

我想将这些结果与来自以下代码的结果合并：

for i, sent in enumerate(document.sentence):
    print("[Sentence {}]".format(i+1))
    for t in sent.token:
        print("{:12s}\t{:12s}\t{:6s}\t{}".format(t.word, t.lemma, t.pos, t.ner))
    print("")

Barack          Barack          NNP     PERSON
Obama           Obama           NNP     PERSON
was             be              VBD     O
born            bear            VBN     O
in              in              IN      O
Hawaii          Hawaii          NNP     STATE_OR_PROVINCE
.               .               .       O

[Sentence 2]
He              he              PRP     O
is              be              VBZ     O
the             the             DT      O
president       president       NN      TITLE
.               .               .       O

[Sentence 3]
Obama           Obama           NNP     PERSON
was             be              VBD     O
elected         elect           VBN     O
in              in              IN      O
2008            2008            CD      DATE
.               .               .       O

由于注释存储在不同的对象中，我无法遍历这两个不同的对象并获取相关项的结果。

有出路吗？

谢谢。

Answer 1

coref 链有一个 sentenceIndex 和一个 beginIndex，它们应该与句子中的位置相关联。您可以使用它来关联两者。

https://github.com/stanfordnlp/stanza/blob/f0338f891a03e242c7e11e440dec6e191d54ab77/doc/CoreNLP.proto#L319

编辑：对您的示例代码进行快速而肮脏的更改：

from collections import defaultdict
from stanza.server import CoreNLPClient

client = CoreNLPClient(
    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner', 'coref'],
    be_quiet=False)

text = "Barack Obama was born in Hawaii.  In 2008 he became the president."

doc = client.annotate(text)

animacy = defaultdict(dict)
for x in doc.corefChain:
    for y in x.mention:
        print(y.animacy)
        for i in range(y.beginIndex, y.endIndex):
            animacy[y.sentenceIndex][i] = True
            print(y.sentenceIndex, i)

for sent_idx, sent in enumerate(doc.sentence):
    print("[Sentence {}]".format(sent_idx+1))
    for t_idx, token in enumerate(sent.token):
        animate = animacy[sent_idx].get(t_idx, False)
        print("{:12s}\t{:12s}\t{:6s}\t{:20s}\t{}".format(token.word, token.lemma, token.pos, token.ner, animate))
    print("")

如何在 CoreNLP 中使用共指结果迭代标记属性？

How can I iterate token attributes with coreference results in CoreNLP?

nlp

stanford-nlp