python3.5 中 StanfordNERTagger 的不同结果 - Stanford-ner-2015-12-09

Different result in StanfordNERTagger in python3.5 - Stanford-ner-2015-12-09

我试了运行例句:

from nltk.tag import StanfordNERTagger
_model_filename = r'D:/standford/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz'

_path_to_jar = r'D:/standford/stanford-ner-2015-12-09/stanford-ner.jar'

st = StanfordNERTagger(model_filename=_model_filename, path_to_jar=_path_to_jar)

st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) 

我的输出如下 python:

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

虽然我预计纽约也被选为基于此 reference 的位置。

我尝试了另一个例子如下:

st.tag('Ali is living in London.'.split())

结果如下,正确。

[('Ali', 'PERSON'), ('is', 'O'), ('living', 'O'), ('in', 'O'), ('London.', 'LOCATION')]

你知道为什么它在第一句话中没有将 NY 识别为位置吗?

我正在使用 visual studio 2015,Python 3.5,Stanford-ner-2015-12-09

Stanford NER 工具针对格式正确的新闻文本进行训练,因此标点符号非常重要。来自 docs:

Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances, including models trained on just the CoNLL 2003 English training data.

来自CoNLL 2003 doc

The English data is a collection of news wire articles from the Reuters Corpus. The annotation has been done by people of the University of Antwerp. Because of copyright reasons we only make available the annotations. In order to build the complete data sets you will need access to the Reuters Corpus. It can be obtained for research purposes without any charge from NIST.

通过在例句中添加句号,你应该得到你想要的输出,但仍然没有完美的模型=)

alvas@ubi:~$ export STANFORDTOOLSDIR=$HOME
alvas@ubi:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-ner-2015-12-09/stanford-ner.jar
alvas@ubi:~$ export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-ner-2015-12-09/classifiers
alvas@ubi:~$ python3
Python 3.5.2 (default, Jul  5 2016, 12:43:10) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.tag import StanfordNERTagger
>>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
>>> sent = 'Rami Eid is studying at Stony Brook University in NY .'.split()
>>> st.tag(sent)
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION'), ('.', 'O')]
>>> sent = 'Rami Eid is studying at Stony Brook University in NY'.split()
>>> st.tag(sent)
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]