NER结合BIO令牌形成原始复合词
NER combining BIO tokens to form original compound word
任何将 BIO 标记组合成复合词的方法。
我实现了这个方法来从 BIO 模式中形成单词,但这对于带有标点符号的单词来说效果不佳。例如:S.E.C 使用下面的函数将它作为 S 加入。 E. C
def collapse(ner_result):
# List with the result
collapsed_result = []
current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k,_ in itertools.groupby(collapsed_result))
return collapsed_result
另一种方法:-
我尝试使用 TreebankWordDetokenizer 去标记,但它仍然没有形成原始句子。例如:Orig: sentence -> parties. \n \n IN WITNESS WHEREOF, the parties hereto
标记化和去标记化的句子 -> parties . IN WITNESS WHEREOF, the parties hereto
另一个例子:Orig: sentence -> Group’s employment, Group shall be
标记化和去标记化的句子 -> Group ’ s employment, Group shall be
请注意,使用 TreebankWordDetokenizer 去除了句号和换行符。
有什么方法可以形成复合词吗?
一个非常小的修复应该可以完成这项工作:
def join_tokens(tokens):
res = ''
if tokens:
res = tokens[0]
for token in tokens[1:]:
if not (token.isalpha() and res[-1].isalpha()):
res += token # punctuation
else:
res += ' ' + token # regular word
return res
def collapse(ner_result):
# List with the result
collapsed_result = []
current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))
return collapsed_result
更新
这将解决大多数情况,但从下面的评论中可以看出,总会有异常值。所以完整的解决方案是跟踪创建特定标记的单词的身份。于是
text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]
# lut = [("U",0), (".",0), ("S",0), (".",0), ("Securities",1), ("and",2), ("Exchange",3), ("Commision",4)]
现在,给定标记索引,您可以知道它来自的确切单词,并简单地连接属于同一单词的标记,而当标记属于不同单词时添加 space。所以 NER 结果会是这样的:
[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]
任何将 BIO 标记组合成复合词的方法。 我实现了这个方法来从 BIO 模式中形成单词,但这对于带有标点符号的单词来说效果不佳。例如:S.E.C 使用下面的函数将它作为 S 加入。 E. C
def collapse(ner_result):
# List with the result
collapsed_result = []
current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k,_ in itertools.groupby(collapsed_result))
return collapsed_result
另一种方法:-
我尝试使用 TreebankWordDetokenizer 去标记,但它仍然没有形成原始句子。例如:Orig: sentence -> parties. \n \n IN WITNESS WHEREOF, the parties hereto
标记化和去标记化的句子 -> parties . IN WITNESS WHEREOF, the parties hereto
另一个例子:Orig: sentence -> Group’s employment, Group shall be
标记化和去标记化的句子 -> Group ’ s employment, Group shall be
请注意,使用 TreebankWordDetokenizer 去除了句号和换行符。
有什么方法可以形成复合词吗?
一个非常小的修复应该可以完成这项工作:
def join_tokens(tokens):
res = ''
if tokens:
res = tokens[0]
for token in tokens[1:]:
if not (token.isalpha() and res[-1].isalpha()):
res += token # punctuation
else:
res += ' ' + token # regular word
return res
def collapse(ner_result):
# List with the result
collapsed_result = []
current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))
return collapsed_result
更新
这将解决大多数情况,但从下面的评论中可以看出,总会有异常值。所以完整的解决方案是跟踪创建特定标记的单词的身份。于是
text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]
# lut = [("U",0), (".",0), ("S",0), (".",0), ("Securities",1), ("and",2), ("Exchange",3), ("Commision",4)]
现在,给定标记索引,您可以知道它来自的确切单词,并简单地连接属于同一单词的标记,而当标记属于不同单词时添加 space。所以 NER 结果会是这样的:
[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]