如何使用 python 使用 StanfordNER 对命名实体进行聚类
How to cluster Named Entity using StanfordNER using python
Stanford NER 提供 NER jar 来检测 POS 标签和 NER。但是我在尝试解析时遇到了其中一个句子的问题。语句如下:
Joseph E. Seagram & Sons, INC said on Thursday that it is merging its two United States based wine companies
下面是我的代码
st = StanfordNERTagger('./stanford- ner/classifiers/english.all.3class.distsim.crf.ser.gz',
'./stanford-ner/stanford-ner.jar',
encoding='utf-8')
ne_in_sent = []
with open("./CCAT/2551newsML.txt") as fd:
lines = fd.readlines()
for line in lines:
print(line)
tokenized_text = word_tokenize(line)
classified_text = st.tag(tokenized_text)
ne_tree = stanfordNE2tree(classified_text)
for subtree in ne_tree:
# If subtree is a noun chunk, i.e. NE != "O"
if type(subtree) == Tree:
ne_label = subtree.label()
ne_string = " ".join([token for token, pos in subtree.leaves()])
ne_in_sent.append((ne_string, ne_label))
print(ne_in_sent)
当我解析它时,我得到以下实体作为组织。
(Joseph E. Seagram & Sons,组织)和(Inc,组织)
还有文件中的一些其他文本,例如
TransCo has a very big plane. Transco is moving south.
由于资本化,它区分了组织,因此我得到
2 个实体(TransCo、组织)和(Transco、组织)。
是否可以将它们合并为一个实体?
使用余弦相似度检查器检查相似度
参考:Calculate cosine similarity given 2 sentence strings
Stanford NER 提供 NER jar 来检测 POS 标签和 NER。但是我在尝试解析时遇到了其中一个句子的问题。语句如下:
Joseph E. Seagram & Sons, INC said on Thursday that it is merging its two United States based wine companies
下面是我的代码
st = StanfordNERTagger('./stanford- ner/classifiers/english.all.3class.distsim.crf.ser.gz',
'./stanford-ner/stanford-ner.jar',
encoding='utf-8')
ne_in_sent = []
with open("./CCAT/2551newsML.txt") as fd:
lines = fd.readlines()
for line in lines:
print(line)
tokenized_text = word_tokenize(line)
classified_text = st.tag(tokenized_text)
ne_tree = stanfordNE2tree(classified_text)
for subtree in ne_tree:
# If subtree is a noun chunk, i.e. NE != "O"
if type(subtree) == Tree:
ne_label = subtree.label()
ne_string = " ".join([token for token, pos in subtree.leaves()])
ne_in_sent.append((ne_string, ne_label))
print(ne_in_sent)
当我解析它时,我得到以下实体作为组织。 (Joseph E. Seagram & Sons,组织)和(Inc,组织)
还有文件中的一些其他文本,例如
TransCo has a very big plane. Transco is moving south.
由于资本化,它区分了组织,因此我得到 2 个实体(TransCo、组织)和(Transco、组织)。
是否可以将它们合并为一个实体?
使用余弦相似度检查器检查相似度
参考:Calculate cosine similarity given 2 sentence strings