stanford-nlp 中的回指解析使用 python
Anaphora resolution in stanford-nlp using python
我正在尝试进行照应解析,下面是我的代码。
首先,我导航到我下载 stanford 模块的文件夹。然后我 运行 命令提示符中的命令初始化 stanford nlp 模块
java -mx4g -cp "*;stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
之后我在 Python
中执行下面的代码
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
我想将Tom is a smart boy. He know a lot of thing.
这句话改成Tom is a smart boy. Tom know a lot of thing.
,Python没有教程或任何帮助。
我所能做的就是在 Python
中通过以下代码进行注释
共指分辨率
output = nlp.annotate(sentence, properties={'annotators':'dcoref','outputFormat':'json','ner.useSUTime':'false'})
并通过解析 coref
coreferences = output['corefs']
我低于JSON
coreferences
{u'1': [{u'animacy': u'ANIMATE',
u'endIndex': 2,
u'gender': u'MALE',
u'headIndex': 1,
u'id': 1,
u'isRepresentativeMention': True,
u'number': u'SINGULAR',
u'position': [1, 1],
u'sentNum': 1,
u'startIndex': 1,
u'text': u'Tom',
u'type': u'PROPER'},
{u'animacy': u'ANIMATE',
u'endIndex': 6,
u'gender': u'MALE',
u'headIndex': 5,
u'id': 2,
u'isRepresentativeMention': False,
u'number': u'SINGULAR',
u'position': [1, 2],
u'sentNum': 1,
u'startIndex': 3,
u'text': u'a smart boy',
u'type': u'NOMINAL'},
{u'animacy': u'ANIMATE',
u'endIndex': 2,
u'gender': u'MALE',
u'headIndex': 1,
u'id': 3,
u'isRepresentativeMention': False,
u'number': u'SINGULAR',
u'position': [2, 1],
u'sentNum': 2,
u'startIndex': 1,
u'text': u'He',
u'type': u'PRONOMINAL'}],
u'4': [{u'animacy': u'INANIMATE',
u'endIndex': 7,
u'gender': u'NEUTRAL',
u'headIndex': 4,
u'id': 4,
u'isRepresentativeMention': True,
u'number': u'SINGULAR',
u'position': [2, 2],
u'sentNum': 2,
u'startIndex': 3,
u'text': u'a lot of thing',
u'type': u'NOMINAL'}]}
对此有任何帮助吗?
我遇到了类似的问题。在尝试使用 core nlp 后,我使用 neural coref 解决了它。您可以使用以下代码通过 neural coref 轻松完成这项工作:
import spacy
nlp = spacy.load('en_coref_md')
doc = nlp(u'Phone area code will be valid only when all the below conditions are met. It cannot be left blank. It should be numeric. It cannot be less than 200. Minimum number of digits should be 3. ')
print(doc._.coref_clusters)
print(doc._.coref_resolved)
以上代码的输出为:
[Phone area code: [Phone area code, It, It, It]]
Phone区号只有在满足以下所有条件时才有效。 Phone区号不能留空。 Phone 区号应该是数字。 Phone区号不能小于200,最少位数为3位。
为此,您需要有 spacy,以及可以是 en_coref_md
或 en_coref_lg
或 en_coref_sm
的英文模型。您可以参考以下link以获得更好的解释:
这是一种可能的解决方案,它使用 CoreNLP 输出的数据结构。提供了所有信息。这并不是一个完整的解决方案,可能需要扩展来处理所有情况,但这是一个很好的起点。
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
def resolve(corenlp_output):
""" Transfer the word form of the antecedent to its associated pronominal anaphor(s) """
for coref in corenlp_output['corefs']:
mentions = corenlp_output['corefs'][coref]
antecedent = mentions[0] # the antecedent is the first mention in the coreference chain
for j in range(1, len(mentions)):
mention = mentions[j]
if mention['type'] == 'PRONOMINAL':
# get the attributes of the target mention in the corresponding sentence
target_sentence = mention['sentNum']
target_token = mention['startIndex'] - 1
# transfer the antecedent's word form to the appropriate token in the sentence
corenlp_output['sentences'][target_sentence - 1]['tokens'][target_token]['word'] = antecedent['text']
def print_resolved(corenlp_output):
""" Print the "resolved" output """
possessives = ['hers', 'his', 'their', 'theirs']
for sentence in corenlp_output['sentences']:
for token in sentence['tokens']:
output_word = token['word']
# check lemmas as well as tags for possessive pronouns in case of tagging errors
if token['lemma'] in possessives or token['pos'] == 'PRP$':
output_word += "'s" # add the possessive morpheme
output_word += token['after']
print(output_word, end='')
text = "Tom and Jane are good friends. They are cool. He knows a lot of things and so does she. His car is red, but " \
"hers is blue. It is older than hers. The big cat ate its dinner."
output = nlp.annotate(text, properties= {'annotators':'dcoref','outputFormat':'json','ner.useSUTime':'false'})
resolve(output)
print('Original:', text)
print('Resolved: ', end='')
print_resolved(output)
这给出了以下输出:
Original: Tom and Jane are good friends. They are cool. He knows a lot of things and so does she. His car is red, but hers is blue. It is older than hers. The big cat ate his dinner.
Resolved: Tom and Jane are good friends. Tom and Jane are cool. Tom knows a lot of things and so does Jane. Tom's car is red, but Jane's is blue. His car is older than Jane's. The big cat ate The big cat's dinner.
如您所见,当代词具有 sentence-initial (title-case) 先行词("The big cat" 而不是 "the big cat" 在最后一句)。这取决于先行词的类别——普通名词先行词需要小写,而专有名词先行词则不需要。
可能需要进行一些其他的临时处理(至于我测试句子中的所有格)。它还假定您不想重用原始输出标记,因为它们已被此代码修改。解决这个问题的方法是复制原始数据结构或创建一个新属性并相应地更改 print_resolved
函数。
纠正任何分辨率错误也是另一个挑战!
from stanfordnlp.server import CoreNLPClient
from nltk import tokenize
client = CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner', 'parse', 'coref'], memory='4G', endpoint='http://localhost:9001')
def pronoun_resolution(text):
ann = client.annotate(text)
modified_text = tokenize.sent_tokenize(text)
for coref in ann.corefChain:
antecedent = []
for mention in coref.mention:
phrase = []
for i in range(mention.beginIndex, mention.endIndex):
phrase.append(ann.sentence[mention.sentenceIndex].token[i].word)
if antecedent == []:
antecedent = ' '.join(word for word in phrase)
else:
anaphor = ' '.join(word for word in phrase)
modified_text[mention.sentenceIndex] = modified_text[mention.sentenceIndex].replace(anaphor, antecedent)
modified_text = ' '.join(modified_text)
return modified_text
text = 'Tom is a smart boy. He knows a lot of things.'
pronoun_resolution(text)
输出:'Tom is a smart boy. Tom knows a lot of things.'
我正在尝试进行照应解析,下面是我的代码。
首先,我导航到我下载 stanford 模块的文件夹。然后我 运行 命令提示符中的命令初始化 stanford nlp 模块
java -mx4g -cp "*;stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
之后我在 Python
中执行下面的代码from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
我想将Tom is a smart boy. He know a lot of thing.
这句话改成Tom is a smart boy. Tom know a lot of thing.
,Python没有教程或任何帮助。
我所能做的就是在 Python
中通过以下代码进行注释共指分辨率
output = nlp.annotate(sentence, properties={'annotators':'dcoref','outputFormat':'json','ner.useSUTime':'false'})
并通过解析 coref
coreferences = output['corefs']
我低于JSON
coreferences
{u'1': [{u'animacy': u'ANIMATE',
u'endIndex': 2,
u'gender': u'MALE',
u'headIndex': 1,
u'id': 1,
u'isRepresentativeMention': True,
u'number': u'SINGULAR',
u'position': [1, 1],
u'sentNum': 1,
u'startIndex': 1,
u'text': u'Tom',
u'type': u'PROPER'},
{u'animacy': u'ANIMATE',
u'endIndex': 6,
u'gender': u'MALE',
u'headIndex': 5,
u'id': 2,
u'isRepresentativeMention': False,
u'number': u'SINGULAR',
u'position': [1, 2],
u'sentNum': 1,
u'startIndex': 3,
u'text': u'a smart boy',
u'type': u'NOMINAL'},
{u'animacy': u'ANIMATE',
u'endIndex': 2,
u'gender': u'MALE',
u'headIndex': 1,
u'id': 3,
u'isRepresentativeMention': False,
u'number': u'SINGULAR',
u'position': [2, 1],
u'sentNum': 2,
u'startIndex': 1,
u'text': u'He',
u'type': u'PRONOMINAL'}],
u'4': [{u'animacy': u'INANIMATE',
u'endIndex': 7,
u'gender': u'NEUTRAL',
u'headIndex': 4,
u'id': 4,
u'isRepresentativeMention': True,
u'number': u'SINGULAR',
u'position': [2, 2],
u'sentNum': 2,
u'startIndex': 3,
u'text': u'a lot of thing',
u'type': u'NOMINAL'}]}
对此有任何帮助吗?
我遇到了类似的问题。在尝试使用 core nlp 后,我使用 neural coref 解决了它。您可以使用以下代码通过 neural coref 轻松完成这项工作:
import spacy
nlp = spacy.load('en_coref_md')
doc = nlp(u'Phone area code will be valid only when all the below conditions are met. It cannot be left blank. It should be numeric. It cannot be less than 200. Minimum number of digits should be 3. ')
print(doc._.coref_clusters)
print(doc._.coref_resolved)
以上代码的输出为:
[Phone area code: [Phone area code, It, It, It]]
Phone区号只有在满足以下所有条件时才有效。 Phone区号不能留空。 Phone 区号应该是数字。 Phone区号不能小于200,最少位数为3位。
为此,您需要有 spacy,以及可以是 en_coref_md
或 en_coref_lg
或 en_coref_sm
的英文模型。您可以参考以下link以获得更好的解释:
这是一种可能的解决方案,它使用 CoreNLP 输出的数据结构。提供了所有信息。这并不是一个完整的解决方案,可能需要扩展来处理所有情况,但这是一个很好的起点。
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
def resolve(corenlp_output):
""" Transfer the word form of the antecedent to its associated pronominal anaphor(s) """
for coref in corenlp_output['corefs']:
mentions = corenlp_output['corefs'][coref]
antecedent = mentions[0] # the antecedent is the first mention in the coreference chain
for j in range(1, len(mentions)):
mention = mentions[j]
if mention['type'] == 'PRONOMINAL':
# get the attributes of the target mention in the corresponding sentence
target_sentence = mention['sentNum']
target_token = mention['startIndex'] - 1
# transfer the antecedent's word form to the appropriate token in the sentence
corenlp_output['sentences'][target_sentence - 1]['tokens'][target_token]['word'] = antecedent['text']
def print_resolved(corenlp_output):
""" Print the "resolved" output """
possessives = ['hers', 'his', 'their', 'theirs']
for sentence in corenlp_output['sentences']:
for token in sentence['tokens']:
output_word = token['word']
# check lemmas as well as tags for possessive pronouns in case of tagging errors
if token['lemma'] in possessives or token['pos'] == 'PRP$':
output_word += "'s" # add the possessive morpheme
output_word += token['after']
print(output_word, end='')
text = "Tom and Jane are good friends. They are cool. He knows a lot of things and so does she. His car is red, but " \
"hers is blue. It is older than hers. The big cat ate its dinner."
output = nlp.annotate(text, properties= {'annotators':'dcoref','outputFormat':'json','ner.useSUTime':'false'})
resolve(output)
print('Original:', text)
print('Resolved: ', end='')
print_resolved(output)
这给出了以下输出:
Original: Tom and Jane are good friends. They are cool. He knows a lot of things and so does she. His car is red, but hers is blue. It is older than hers. The big cat ate his dinner.
Resolved: Tom and Jane are good friends. Tom and Jane are cool. Tom knows a lot of things and so does Jane. Tom's car is red, but Jane's is blue. His car is older than Jane's. The big cat ate The big cat's dinner.
如您所见,当代词具有 sentence-initial (title-case) 先行词("The big cat" 而不是 "the big cat" 在最后一句)。这取决于先行词的类别——普通名词先行词需要小写,而专有名词先行词则不需要。
可能需要进行一些其他的临时处理(至于我测试句子中的所有格)。它还假定您不想重用原始输出标记,因为它们已被此代码修改。解决这个问题的方法是复制原始数据结构或创建一个新属性并相应地更改 print_resolved
函数。
纠正任何分辨率错误也是另一个挑战!
from stanfordnlp.server import CoreNLPClient
from nltk import tokenize
client = CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner', 'parse', 'coref'], memory='4G', endpoint='http://localhost:9001')
def pronoun_resolution(text):
ann = client.annotate(text)
modified_text = tokenize.sent_tokenize(text)
for coref in ann.corefChain:
antecedent = []
for mention in coref.mention:
phrase = []
for i in range(mention.beginIndex, mention.endIndex):
phrase.append(ann.sentence[mention.sentenceIndex].token[i].word)
if antecedent == []:
antecedent = ' '.join(word for word in phrase)
else:
anaphor = ' '.join(word for word in phrase)
modified_text[mention.sentenceIndex] = modified_text[mention.sentenceIndex].replace(anaphor, antecedent)
modified_text = ' '.join(modified_text)
return modified_text
text = 'Tom is a smart boy. He knows a lot of things.'
pronoun_resolution(text)
输出:'Tom is a smart boy. Tom knows a lot of things.'