NER结合BIO令牌形成原始复合词

Question

任何将 BIO 标记组合成复合词的方法。我实现了这个方法来从 BIO 模式中形成单词，但这对于带有标点符号的单词来说效果不佳。例如：S.E.C 使用下面的函数将它作为 S 加入。 E. C

def collapse(ner_result):
    # List with the result
    collapsed_result = []


    current_entity_tokens = []
    current_entity = None

    # Iterate over the tagged tokens
    for token, tag in ner_result:

        if tag.startswith("B-"):
            # ... if we have a previous entity in the buffer, store it in the result list
            if current_entity is not None:
                collapsed_result.append([" ".join(current_entity_tokens), current_entity])

            current_entity = tag[2:]
            # The new entity has so far only one token
            current_entity_tokens = [token]

        # If the entity continues ...
        elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
            # Just add the token buffer
            current_entity_tokens.append(token)
        else:
            collapsed_result.append([" ".join(current_entity_tokens), current_entity])
            collapsed_result.append([token,tag[2:]])

            current_entity_tokens = []
            current_entity = None

            pass

    # The last entity is still in the buffer, so add it to the result
    # ... but only if there were some entity at all
    if current_entity is not None:
        collapsed_result.append([" ".join(current_entity_tokens), current_entity])
        collapsed_result = sorted(collapsed_result)
        collapsed_result = list(k for k,_ in itertools.groupby(collapsed_result))


    return collapsed_result

另一种方法：-

我尝试使用 TreebankWordDetokenizer 去标记，但它仍然没有形成原始句子。例如：Orig: sentence -> parties. \n \n IN WITNESS WHEREOF, the parties hereto 标记化和去标记化的句子 -> parties . IN WITNESS WHEREOF, the parties hereto

另一个例子：Orig: sentence -> Group’s employment, Group shall be 标记化和去标记化的句子 -> Group ’ s employment, Group shall be

请注意，使用 TreebankWordDetokenizer 去除了句号和换行符。

有什么方法可以形成复合词吗？

Answer 1

一个非常小的修复应该可以完成这项工作：

def join_tokens(tokens):
    res = ''
    if tokens:
        res = tokens[0]
        for token in tokens[1:]:
            if not (token.isalpha() and res[-1].isalpha()):
                res += token  # punctuation
            else:
                res += ' ' + token  # regular word
    return res

def collapse(ner_result):
    # List with the result
    collapsed_result = []


    current_entity_tokens = []
    current_entity = None

    # Iterate over the tagged tokens
    for token, tag in ner_result:

        if tag.startswith("B-"):
            # ... if we have a previous entity in the buffer, store it in the result list
            if current_entity is not None:
                collapsed_result.append([join_tokens(current_entity_tokens), current_entity])

            current_entity = tag[2:]
            # The new entity has so far only one token
            current_entity_tokens = [token]

        # If the entity continues ...
        elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
            # Just add the token buffer
            current_entity_tokens.append(token)
        else:
            collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
            collapsed_result.append([token,tag[2:]])

            current_entity_tokens = []
            current_entity = None

            pass

    # The last entity is still in the buffer, so add it to the result
    # ... but only if there were some entity at all
    if current_entity is not None:
        collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
        collapsed_result = sorted(collapsed_result)
        collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))

    return collapsed_result

更新

这将解决大多数情况，但从下面的评论中可以看出，总会有异常值。所以完整的解决方案是跟踪创建特定标记的单词的身份。于是

text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]  

# lut = [("U",0), (".",0), ("S",0), (".",0), ("Securities",1), ("and",2), ("Exchange",3), ("Commision",4)]

现在，给定标记索引，您可以知道它来自的确切单词，并简单地连接属于同一单词的标记，而当标记属于不同单词时添加 space。所以 NER 结果会是这样的：

[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]

NER结合BIO令牌形成原始复合词

NER combining BIO tokens to form original compound word

python

named-entity-recognition

更新