如何阅读文本并在 Python 中标记每个单词

Question

data = ("Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country. Many people have been killed that day.",
        {"entities": [(48, 54, 'Category 1'), (77, 81, 'Category 1'), (111, 118, 'Category 2'), (150, 173, 'Category 3')]})

data[1]['entities'][0] = (48, 54, 'Category 1')代表(start_offset, end_offset, entity)。

我想按顺序读取data[0]的每个单词，并根据data[1]个实体标记每个单词。我期待最终输出，

{
'Thousands': 'O', 
'of': 'O',
'demonstrators': 'O',
'have': 'O',
'marched': 'O',
'through': 'O',
'London': 'S-1',
'to': 'O', 
'protest': 'O', 
'the': 'O', 
'war': 'O', 
'in': 'O', 
'Iraq': 'S-1',
'and': 'O' 
'demand': 'O', 
'the': 'O', 
'withdrawal': 'O', 
'of': 'O', 
'British': 'S-2', 
'troops': 'O', 
'from': 'O',
'that': 'O', 
'country': 'O',
'.': 'O',
'Many': 'O', 
'people': 'S-3', 
'have': 'B-3', 
'been': 'B-3', 
'killed': 'E-3', 
'that': 'O', 
'day': 'O',
'.': 'O'
}

这里，'O'代表'OutOfEntity'，'S'代表'Start'，'B'代表'Between'，'E' 代表 'End' 并且对于每个给定的文本都是唯一的。

我尝试了以下方法：

def ner(data):
    entities = {}
    offsets = data[1]['entities']
    for entity in offsets:
        entities[data[0][int(entity[0]):int(entity[1])]] = re.findall('[0-9]+', entity[2])[0]
    
    tags = []
    for key, value in entities.items():
        entity = key.split()
        if len(entity) > 1:
            bEntity = entity[1:-1]
            tags.append((entity[0], 'S-'+value))
            for item in bEntity:
                tags.append((item, 'B-'+value))
            tags.append((entity[-1], 'E-'+value))
        else:
            tags.append((entity[0], 'S-'+value))
    
    tokens = nltk.word_tokenize(data[0])
    OTokens = [(token, 'O') for token in tokens if token not in [token[0] for token in tags]]
    for token in OTokens:
        tags.append(token)
    
    return tags

但是如果我有一些与 data[1]['entities'] 偏移量中的相同的单词但不是部分偏移量将被忽略而不是它们应该被标记为 'O'.

Answer 1

不确定最终格式是否为json，但下面是将数据处理为打印格式的示例，即

# sample output
'''
{
'Thousands': 'O',
'of': 'O',
'demonstrators': 'O',
'have': 'O',
'marched': 'O',
'through': 'O',
'London': 'S-1',
'to': 'O',
'protest': 'O',
'the': 'O',
'war': 'O',
'in': 'O',
'Iraq': 'S-1',
'and': 'O',
'demand': 'O',
'the': 'O',
'withdrawal': 'O',
'of': 'O',
'British': 'S-2',
'troops': 'O',
'from': 'O',
'that': 'O',
'country.': 'O',
'Many': 'O',
'people': 'S-3',
'have': 'B-3',
'been': 'B-3',
'killed': 'E-3',
'that': 'O',
'day.': 'O'
}
'''
# sample code
data = ("Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country. Many people have been killed that day.",
        {"entities": [(48, 54, 'Category 1'), (77, 81, 'Category 1'), (111, 118, 'Category 2'), (150, 173, 'Category 3')]})

print("{")
pre = 0
for i in (data[1].values())[0]:
        a = data[0][i[0]:i[1]].split()
        t = pre + i[1]
        #print(pre, i[0])
        b = data[0][pre:i[0]].split()
        for j in b:
                print("'%s': '%s'," % (j, "O"))
        pre = i[1]
        for j in range(len(a)): 
                if j == 0:
                        print("'%s': '%s-%s'," % (a[j], "S", i[2][-1]))
                elif j == len(a) - 1:
                        print("'%s': '%s-%s'," % (a[j], "E", i[2][-1]))
                else:
                        print("'%s': '%s-%s'," % (a[j], "B", i[2][-1]))
#print(i[1], las)
las = len(data[0])
c = data[0][i[1]:las].split()
for j in range(len(c)):
        if j == len(c) - 1:
                print("'%s': '%s'" % (c[j], "O"))
        else:
                print("'%s': '%s'," % (c[j], "O"))
print("}")

如何阅读文本并在 Python 中标记每个单词

How to read a text and label each word of it in Python

python

text

named-entity-recognition

nltk