从给定索引和文本源的标记重新创建多标记字符串

Question

我正在准备一个脚本，用于从具有特定标签的标记的标记化文本中重构多标记字符串。我的标记与它们在原文中的开始和结束索引相关联。

这是一段文本示例：

t = "Breakfast at Tiffany's is a novella by Truman Capote."

包含原文索引和标签的tokens数据结构：

[(['Breakfast', 0, 9], 'BOOK'),
 (['at', 10, 12], 'BOOK'),
 (['Tiffany', 13, 20], 'BOOK'),
 (["'", 20, 21], 'BOOK'),
 (['s', 21, 22], 'BOOK'),
 (['is', 23, 25], 'O'),
 (['a', 26, 27], 'O'),
 (['novella', 28, 35], 'O'),
 (['by', 36, 38], 'O'),
 (['Truman', 39, 45], 'PER'),
 (['Capote', 46, 52], 'PER'),
 (['.', 52, 53], 'O')]

这个数据结构是从t生成的，如下

import re

tokens = [[m.group(0), m.start(), m.end()] for m in re.finditer(r"\w+|[^\w\s]", t, re.UNICODE)]
tags = ['BOOK', 'BOOK', 'BOOK', 'BOOK', 'BOOK', 'O', 'O', 'O', 'O', 'PER', 'PER', 'O']
token_tuples = list(zip(tokens, tags))

我希望我的脚本做的是遍历 token_tuples，如果遇到非 O 标记，它会从主迭代中断并重新构造标记的多标记跨度直到它命中最近的标记 O。

这是当前脚本：

for i in range(len(token_tuples)):

    if token_tuples[i][1] != 'O':

        tag = token_tuples[i][1]
        start_ix = token_tuples[i][0][1]

        slider = i+1

        while slider < len(token_tuples):

            if tag != token_tuples[slider][1]:

                end_ix = token_tuples[slider][0][2]

                print((t[start_ix:end_ix], tag))
                break

            else:
                slider+=1

这会打印：

("Breakfast at Tiffany's is", 'BOOK')
("at Tiffany's is", 'BOOK')
("Tiffany's is", 'BOOK')
("'s is", 'BOOK')
('s is', 'BOOK')
('Truman Capote.', 'PER')
('Capote.', 'PER')

需要修改什么，以便此示例的输出为：

> ("Breakfast at Tiffany's", "BOOK")
> ("Truman Capote", "PER")

Answer 1

这是一种解决方案。如果你能想出不那么冗长的东西，我很乐意选择你的答案！

def extract_entities(t, token_tuples):

    entities = []
    tag = ''

    for i in range(len(token_tuples)):

        if token_tuples[i][1] != 'O':

            if token_tuples[i][1] != tag:
                tag = token_tuples[i][1]
                start_ix = token_tuples[i][0][1]

            if i+1 < len(token_tuples):

                if tag != token_tuples[i+1][1]:
                    end_ix = token_tuples[i][0][2]
                    entities.append((t[start_ix:end_ix], tag))
                    tag = ''

    return(entities)

从给定索引和文本源的标记重新创建多标记字符串

recreate multi-token strings from tokens given indices and text source

python

string

iteration

string-matching

python-re