如何标记命名实体以准备训练数据以使用 spacy 进行自定义命名实体识别?
How to tag named entities to prepare training data for custom named entity recognition with spacy?
我想在我的自定义数据集上训练 spacy 命名实体识别器。我已经准备了一个 python 字典,其中包含 key = entity_type 和值列表 = 实体名称 ,但我没有得到任何可以标记的方法正确格式的标记。
我尝试了普通的字符串匹配(查找)和正则表达式(搜索、编译),但没有得到我想要的。
例如:我的句子和我正在使用的字典是(这是例子)
sentence = "Machine learning and data mining often employ the same methods
and overlap significantly."
dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
'DM': ['data mining']}
for k,v in dic.items():
for val in v:
if val in sentence:
print(k, val, sentence.index(val)) #right now I'm just printing
#the key, val and starting index
output:
MLDM machine learning and data mining 0
ML machine learning 0
DM data mining 21
expected output: MLDM 0 32
so I can further prepare training data to train Spacy NER :
[{"content":"machine learning and data mining often employ the same methods
and overlap significantly.","entities":[[0,32,"MLDM"]]}
您可以根据 dic
中的所有值构建一个正则表达式,将它们作为整个单词进行匹配,并在匹配时获取与匹配值关联的键。我假设值项在字典中是唯一的,它们可以包含空格并且只包含 "word" 个字符(没有像 +
或 (
这样的特殊字符)。
import re
sentence = "Machine learning and data mining often employ the same methods and overlap significantly."
dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
'DM': ['data mining']}
def get_key(val):
for k,v in dic.items():
if m.group().lower() in map(str.lower, v):
return k
return ''
# Flatten the lists in values and sort the list by length in descending order
l=sorted([v for x in dic.values() for v in x], key=len, reverse=True)
# Build the alternation based regex with \b to match each item as a whole word
rx=r'\b(?:{})\b'.format("|".join(l))
for m in re.finditer(rx, sentence, re.I): # Search case insensitively
key = get_key(m.group())
if key:
print("{} {}".format(key, m.start()))
我想在我的自定义数据集上训练 spacy 命名实体识别器。我已经准备了一个 python 字典,其中包含 key = entity_type 和值列表 = 实体名称 ,但我没有得到任何可以标记的方法正确格式的标记。
我尝试了普通的字符串匹配(查找)和正则表达式(搜索、编译),但没有得到我想要的。
例如:我的句子和我正在使用的字典是(这是例子)
sentence = "Machine learning and data mining often employ the same methods
and overlap significantly."
dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
'DM': ['data mining']}
for k,v in dic.items():
for val in v:
if val in sentence:
print(k, val, sentence.index(val)) #right now I'm just printing
#the key, val and starting index
output:
MLDM machine learning and data mining 0
ML machine learning 0
DM data mining 21
expected output: MLDM 0 32
so I can further prepare training data to train Spacy NER :
[{"content":"machine learning and data mining often employ the same methods
and overlap significantly.","entities":[[0,32,"MLDM"]]}
您可以根据 dic
中的所有值构建一个正则表达式,将它们作为整个单词进行匹配,并在匹配时获取与匹配值关联的键。我假设值项在字典中是唯一的,它们可以包含空格并且只包含 "word" 个字符(没有像 +
或 (
这样的特殊字符)。
import re
sentence = "Machine learning and data mining often employ the same methods and overlap significantly."
dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
'DM': ['data mining']}
def get_key(val):
for k,v in dic.items():
if m.group().lower() in map(str.lower, v):
return k
return ''
# Flatten the lists in values and sort the list by length in descending order
l=sorted([v for x in dic.values() for v in x], key=len, reverse=True)
# Build the alternation based regex with \b to match each item as a whole word
rx=r'\b(?:{})\b'.format("|".join(l))
for m in re.finditer(rx, sentence, re.I): # Search case insensitively
key = get_key(m.group())
if key:
print("{} {}".format(key, m.start()))