是否可以使用自定义命名实体来改进 spaCy 的相似性结果?
Is it possible to improve spaCy's similarity results with custom named entities?
我发现 spaCy 的相似性在使用开箱即用的 "en_core_web_lg" 比较我的文档方面做得不错。
我想加强某些领域的关系,并认为向模型添加自定义 NER 标签会有所帮助,但我之前和之后的结果没有显示任何改进,即使我已经能够创建测试集自定义实体。
现在我想知道,我的理论是完全错误的,还是我只是漏掉了一些东西?
如果我错了,改进结果的最佳方法是什么?似乎某种自定义标签应该有所帮助。
这是我目前测试过的示例:
import spacy
from spacy.pipeline import EntityRuler
from spacy.tokens import Doc
from spacy.gold import GoldParse
nlp = spacy.load("en_core_web_lg")
docA = nlp("Add fractions with like denominators.")
docB = nlp("What does one-third plus one-third equal?")
sim_before = docA.similarity(docB)
print(sim_before)
0.5949629181460099
^^ 还不错,但我希望在这个例子中看到接近 0.85 的结果。
因此,我使用 EntityRuler 并添加一些模式来尝试加强关系:
ruler = EntityRuler(nlp)
patterns = [
{"label": "ADDITION", "pattern": "Add"},
{"label": "ADDITION", "pattern": "plus"},
{"label": "FRACTION", "pattern": "one-third"},
{"label": "FRACTION", "pattern": "fractions"},
{"label": "FRACTION", "pattern": "denominators"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before='ner')
print(nlp.pipe_names)
['tagger', 'parser', 'entity_ruler', 'ner']
添加 GoldParse 似乎很重要,所以我添加了以下内容并更新了 NER:
doc1 = Doc(nlp.vocab, [u'What', u'does', u'one-third', u'plus', u'one-third', u'equal'])
gold1 = GoldParse(doc1, [u'0', u'0', u'U-FRACTION', u'U-ADDITION', u'U-FRACTION', u'O'])
doc2 = Doc(nlp.vocab, [u'Add', u'fractions', u'with', u'like', u'denominators'])
gold2 = GoldParse(doc2, [u'U-ADDITION', u'U-FRACTION', u'O', u'O', u'U-FRACTION'])
ner = nlp.get_pipe("ner")
losses = {}
optimizer = nlp.begin_training()
ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
{'ner': 0.0}
您可以看到我的自定义实体正在运行,但测试结果显示改进为零:
test1 = nlp("Add fractions with like denominators.")
test2 = nlp("What does one-third plus one-third equal?")
print([(ent.text, ent.label_) for ent in test1.ents])
print([(ent.text, ent.label_) for ent in test2.ents])
sim = test1.similarity(test2)
print(sim)
[('Add', 'ADDITION'), ('fractions', 'FRACTION'), ('denominators' , 'FRACTION')]
[('one-third', 'FRACTION'), ('plus', 'ADDITION'), ('one-third', 'FRACTION')]
0.5949629181460099
如有任何提示,我们将不胜感激!
Doc.similarity
仅使用词向量,不使用任何其他注释。来自 Doc API:
The default estimate is cosine similarity using an average of word vectors.
我发现我的解决方案位于本教程中:Text Classification in Python Using spaCy, which generates a BoW matrix for spaCy's text data by using SciKit-Learn's CountVectorizer。
由于二元分类,我避免了情感分析教程,因为我需要对多个类别的支持。诀窍是在 LogisticRegression 线性模型上设置 multi_class='auto',并使用 average='micro' 在精度分数和精度召回上,所以我所有的文本数据,比如实体,都被利用了:
classifier = LogisticRegression(solver='lbfgs', multi_class='auto')
和...
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted,average='micro'))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted,average='micro'))
希望这有助于节省一些时间!
我发现 spaCy 的相似性在使用开箱即用的 "en_core_web_lg" 比较我的文档方面做得不错。
我想加强某些领域的关系,并认为向模型添加自定义 NER 标签会有所帮助,但我之前和之后的结果没有显示任何改进,即使我已经能够创建测试集自定义实体。
现在我想知道,我的理论是完全错误的,还是我只是漏掉了一些东西?
如果我错了,改进结果的最佳方法是什么?似乎某种自定义标签应该有所帮助。
这是我目前测试过的示例:
import spacy
from spacy.pipeline import EntityRuler
from spacy.tokens import Doc
from spacy.gold import GoldParse
nlp = spacy.load("en_core_web_lg")
docA = nlp("Add fractions with like denominators.")
docB = nlp("What does one-third plus one-third equal?")
sim_before = docA.similarity(docB)
print(sim_before)
0.5949629181460099
^^ 还不错,但我希望在这个例子中看到接近 0.85 的结果。
因此,我使用 EntityRuler 并添加一些模式来尝试加强关系:
ruler = EntityRuler(nlp)
patterns = [
{"label": "ADDITION", "pattern": "Add"},
{"label": "ADDITION", "pattern": "plus"},
{"label": "FRACTION", "pattern": "one-third"},
{"label": "FRACTION", "pattern": "fractions"},
{"label": "FRACTION", "pattern": "denominators"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before='ner')
print(nlp.pipe_names)
['tagger', 'parser', 'entity_ruler', 'ner']
添加 GoldParse 似乎很重要,所以我添加了以下内容并更新了 NER:
doc1 = Doc(nlp.vocab, [u'What', u'does', u'one-third', u'plus', u'one-third', u'equal'])
gold1 = GoldParse(doc1, [u'0', u'0', u'U-FRACTION', u'U-ADDITION', u'U-FRACTION', u'O'])
doc2 = Doc(nlp.vocab, [u'Add', u'fractions', u'with', u'like', u'denominators'])
gold2 = GoldParse(doc2, [u'U-ADDITION', u'U-FRACTION', u'O', u'O', u'U-FRACTION'])
ner = nlp.get_pipe("ner")
losses = {}
optimizer = nlp.begin_training()
ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
{'ner': 0.0}
您可以看到我的自定义实体正在运行,但测试结果显示改进为零:
test1 = nlp("Add fractions with like denominators.")
test2 = nlp("What does one-third plus one-third equal?")
print([(ent.text, ent.label_) for ent in test1.ents])
print([(ent.text, ent.label_) for ent in test2.ents])
sim = test1.similarity(test2)
print(sim)
[('Add', 'ADDITION'), ('fractions', 'FRACTION'), ('denominators' , 'FRACTION')]
[('one-third', 'FRACTION'), ('plus', 'ADDITION'), ('one-third', 'FRACTION')]
0.5949629181460099
如有任何提示,我们将不胜感激!
Doc.similarity
仅使用词向量,不使用任何其他注释。来自 Doc API:
The default estimate is cosine similarity using an average of word vectors.
我发现我的解决方案位于本教程中:Text Classification in Python Using spaCy, which generates a BoW matrix for spaCy's text data by using SciKit-Learn's CountVectorizer。
由于二元分类,我避免了情感分析教程,因为我需要对多个类别的支持。诀窍是在 LogisticRegression 线性模型上设置 multi_class='auto',并使用 average='micro' 在精度分数和精度召回上,所以我所有的文本数据,比如实体,都被利用了:
classifier = LogisticRegression(solver='lbfgs', multi_class='auto')
和...
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted,average='micro'))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted,average='micro'))
希望这有助于节省一些时间!