在 Spacy 中查找开始和结束字符索引

Question

我正在 Spacy 中训练自定义模型以提取自定义实体，但虽然我需要提供包含我的实体以及索引位置的输入训练数据，但我想了解是否有更快的方法来分配我在训练数据的特定句子中查找的关键字的索引值。

我的训练数据示例：

TRAIN_DATA = [

('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance,
 {'entities': [(25, 37, 'BS'),(40, ,60, 'BS'),(62, 79, 'BS')]
 })
            ]

现在要在我的训练数据中传递特定关键词的索引位置，我目前正在手动计算它以给出我的关键词的位置。

例如：如果第一行我说行为技能包括沟通等，我正在手动计算“沟通”一词的索引位置，即 25,37。

我相信一定有另一种方法可以通过其他一些方法来识别这些索引的位置，而不是手动计数。有什么想法可以实现吗？

Answer 1

使用 str.find() 可以提供帮助。但是，你必须循环遍历句子和关键字

keywords = ['Communication', 'Conflict Resolution', 'Work Life Balance']
texts = ['Behaviour Skills include Communication, Conflict Resolution, Work Life Balance', 
        'Some sentence where lower case conflict resolution is included']

LABEL = 'BS'
TRAIN_DATA = []

for text in texts:
    entities = []
    t_low = text.lower()
    for keyword in keywords:
        k_low = keyword.lower()
        begin = t_low.find(k_low) # index if substring found and -1 otherwise
        if begin != -1:
            end = begin + len(keyword)
            entities.append((begin, end, LABEL))
    TRAIN_DATA.append((text, {'entities': entities}))

输出：

[('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance', 
{'entities': [(25, 38, 'BS'), (40, 59, 'BS'), (61, 78, 'BS')]}), 
('Some sentence where lower case conflict resolution is included', 
{'entities': [(31, 50, 'BS')]})]

我添加了 str.lower() 以备不时之需。

在 Spacy 中查找开始和结束字符索引

Finding the Start and End char indices in Spacy

nlp

named-entity-recognition

indices

python-3.x

spacy