斯坦福 NER 标注器 NLTK (python) 与 JAVA 的结果差异
Result Difference in Stanford NER tagger NLTK (python) vs JAVA
我同时使用 python 和 java 来 运行 Stanford NER 标注器,但我发现结果有所不同。
比如我输入"Involved in all aspects of data modeling using ERwin as the primary software for this.",
这句话的时候
JAVA 结果:
"ERwin": "PERSON"
Python 结果:
In [6]: NERTagger.tag("Involved in all aspects of data modeling using ERwin as the primary software for this.".split())
Out [6]:[(u'Involved', u'O'),
(u'in', u'O'),
(u'all', u'O'),
(u'aspects', u'O'),
(u'of', u'O'),
(u'data', u'O'),
(u'modeling', u'O'),
(u'using', u'O'),
(u'ERwin', u'O'),
(u'as', u'O'),
(u'the', u'O'),
(u'primary', u'O'),
(u'software', u'O'),
(u'for', u'O'),
(u'this.', u'O')]
Python nltk 包装器无法将 "ERwin" 捕获为 PERSON。
这里有趣的是 Python 和 Java 使用相同的训练数据 (english.all.3class.caseless.distsim.crf.ser.gz) 发布于 2015-04-20.
我的最终目标是让 python 以与 Java 相同的方式工作。
我正在查看 nltk.tag 中的 StanfordNERTagger,看看是否有任何我可以修改的地方。下面是包装代码:
class StanfordNERTagger(StanfordTagger):
"""
A class for Named-Entity Tagging with Stanford Tagger. The input is the paths to:
- a model trained on training data
- (optionally) the path to the stanford tagger jar file. If not specified here,
then this jar file must be specified in the CLASSPATH envinroment variable.
- (optionally) the encoding of the training data (default: UTF-8)
Example:
>>> from nltk.tag import StanfordNERTagger
>>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') # doctest: +SKIP
>>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) # doctest: +SKIP
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
"""
_SEPARATOR = '/'
_JAR = 'stanford-ner.jar'
_FORMAT = 'slashTags'
def __init__(self, *args, **kwargs):
super(StanfordNERTagger, self).__init__(*args, **kwargs)
@property
def _cmd(self):
# Adding -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions tokenizeNLs=false for not using stanford Tokenizer
return ['edu.stanford.nlp.ie.crf.CRFClassifier',
'-loadClassifier', self._stanford_model, '-textFile',
self._input_file_path, '-outputFormat', self._FORMAT, '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions','\"tokenizeNLs=false\"']
def parse_output(self, text, sentences):
if self._FORMAT == 'slashTags':
# Joint together to a big list
tagged_sentences = []
for tagged_sentence in text.strip().split("\n"):
for tagged_word in tagged_sentence.strip().split():
word_tags = tagged_word.strip().split(self._SEPARATOR)
tagged_sentences.append((''.join(word_tags[:-1]), word_tags[-1]))
# Separate it according to the input
result = []
start = 0
for sent in sentences:
result.append(tagged_sentences[start:start + len(sent)])
start += len(sent);
return result
raise NotImplementedError
或者,如果是因为使用了不同的分类器(在 java 代码中,它似乎使用了 AbstractSequenceClassifier,另一方面,python nltk wrapper 使用了 CRFClassifier。)有没有办法我可以在 python 包装器中使用 AbstractSequenceClassifier 吗?
尝试在 CoreNLP 的属性文件(或命令行)中将 maxAdditionalKnownLCWords
设置为 0,如果可能,也将 NLTK 设置为 0。这会禁用允许 NER 系统从 test-time 数据中学习一点点的选项,这可能会导致偶尔出现轻微不同的结果。
我同时使用 python 和 java 来 运行 Stanford NER 标注器,但我发现结果有所不同。
比如我输入"Involved in all aspects of data modeling using ERwin as the primary software for this.",
这句话的时候JAVA 结果:
"ERwin": "PERSON"
Python 结果:
In [6]: NERTagger.tag("Involved in all aspects of data modeling using ERwin as the primary software for this.".split())
Out [6]:[(u'Involved', u'O'),
(u'in', u'O'),
(u'all', u'O'),
(u'aspects', u'O'),
(u'of', u'O'),
(u'data', u'O'),
(u'modeling', u'O'),
(u'using', u'O'),
(u'ERwin', u'O'),
(u'as', u'O'),
(u'the', u'O'),
(u'primary', u'O'),
(u'software', u'O'),
(u'for', u'O'),
(u'this.', u'O')]
Python nltk 包装器无法将 "ERwin" 捕获为 PERSON。
这里有趣的是 Python 和 Java 使用相同的训练数据 (english.all.3class.caseless.distsim.crf.ser.gz) 发布于 2015-04-20.
我的最终目标是让 python 以与 Java 相同的方式工作。
我正在查看 nltk.tag 中的 StanfordNERTagger,看看是否有任何我可以修改的地方。下面是包装代码:
class StanfordNERTagger(StanfordTagger):
"""
A class for Named-Entity Tagging with Stanford Tagger. The input is the paths to:
- a model trained on training data
- (optionally) the path to the stanford tagger jar file. If not specified here,
then this jar file must be specified in the CLASSPATH envinroment variable.
- (optionally) the encoding of the training data (default: UTF-8)
Example:
>>> from nltk.tag import StanfordNERTagger
>>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') # doctest: +SKIP
>>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) # doctest: +SKIP
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
"""
_SEPARATOR = '/'
_JAR = 'stanford-ner.jar'
_FORMAT = 'slashTags'
def __init__(self, *args, **kwargs):
super(StanfordNERTagger, self).__init__(*args, **kwargs)
@property
def _cmd(self):
# Adding -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions tokenizeNLs=false for not using stanford Tokenizer
return ['edu.stanford.nlp.ie.crf.CRFClassifier',
'-loadClassifier', self._stanford_model, '-textFile',
self._input_file_path, '-outputFormat', self._FORMAT, '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions','\"tokenizeNLs=false\"']
def parse_output(self, text, sentences):
if self._FORMAT == 'slashTags':
# Joint together to a big list
tagged_sentences = []
for tagged_sentence in text.strip().split("\n"):
for tagged_word in tagged_sentence.strip().split():
word_tags = tagged_word.strip().split(self._SEPARATOR)
tagged_sentences.append((''.join(word_tags[:-1]), word_tags[-1]))
# Separate it according to the input
result = []
start = 0
for sent in sentences:
result.append(tagged_sentences[start:start + len(sent)])
start += len(sent);
return result
raise NotImplementedError
或者,如果是因为使用了不同的分类器(在 java 代码中,它似乎使用了 AbstractSequenceClassifier,另一方面,python nltk wrapper 使用了 CRFClassifier。)有没有办法我可以在 python 包装器中使用 AbstractSequenceClassifier 吗?
尝试在 CoreNLP 的属性文件(或命令行)中将 maxAdditionalKnownLCWords
设置为 0,如果可能,也将 NLTK 设置为 0。这会禁用允许 NER 系统从 test-time 数据中学习一点点的选项,这可能会导致偶尔出现轻微不同的结果。