从给定文本创建 NER 字典
Create a NER dictionary from a given text
我有以下变量
data = ("Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country. Many people have been killed that day.",
{"entities": [(48, 54, 'Category 1'), (77, 81, 'Category 1'), (111, 118, 'Category 2'), (150, 173, 'Category 3')]})
data[1]['entities'][0] = (48, 54, 'Category 1')
代表(start_offset, end_offset, entity)
.
我想读取data[0]
的每一个词,并根据data[1]
个实体进行标记。我期待最终输出,
{
'Thousands': 'O',
'of': 'O',
'demonstrators': 'O',
'have': 'O',
'marched': 'O',
'through': 'O',
'London': 'S-1',
'to': 'O',
'protest': 'O',
'the': 'O',
'war': 'O',
'in': 'O',
'Iraq': 'S-1',
'and': 'O'
'demand': 'O',
'the': 'O',
'withdrawal': 'O',
'of': 'O',
'British': 'S-2',
'troops': 'O',
'from': 'O',
'that': 'O',
'country': 'O',
'.': 'O',
'Many': 'O',
'people': 'S-3',
'have': 'B-3',
'been': 'B-3',
'killed': 'E-3',
'that': 'O',
'day': 'O',
'.': 'O'
}
这里,'O'代表'OutOfEntity','S'代表'Start','B'代表'Between','E' 代表 'End' 并且对于每个给定的文本都是唯一的。
我尝试了以下方法:
entities = {}
offsets = data[1]['entities']
for entity in offsets:
entities[data[0][entity[0]:entity[1]]] = re.findall('[0-9]+', entity[2])[0]
tags = {}
for key, value in entities.items():
entity = key.split()
if len(entity) > 1:
bEntity = entity[1:-1]
tags[entity[0]] = 'S-'+value
tags[entity[-1]] = 'E-'+value
for item in bEntity:
tags[item] = 'B-'+value
else:
tags[entity[0]] = 'S-'+value
输出将是
{'London': 'S-1',
'Iraq': 'S-1',
'British': 'S-2',
'people': 'S-3',
'killed': 'E-3',
'have': 'B-3',
'been': 'B-3'}
从这一点开始,我陷入了如何处理 'O' 实体的困境。另外,我想构建更高效和可读性更高的代码。我认为字典数据结构不会更有效地工作,因为我可以使用相同的词作为键。
def ner(data):
entities = {}
offsets = data[1]['entities']
for entity in offsets:
entities[data[0][int(entity[0]):int(entity[1])]] = re.findall('[0-9]+', entity[2])[0]
tags = []
for key, value in entities.items():
entity = key.split()
if len(entity) > 1:
bEntity = entity[1:-1]
tags.append((entity[0], 'S-'+value))
for item in bEntity:
tags.append((item, 'B-'+value))
tags.append((entity[-1], 'E-'+value))
else:
tags.append((entity[0], 'S-'+value))
tokens = nltk.word_tokenize(data[0])
OTokens = [(token, 'O') for token in tokens if token not in [token[0] for token in tags]]
for token in OTokens:
tags.append(token)
return tags
我有以下变量
data = ("Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country. Many people have been killed that day.",
{"entities": [(48, 54, 'Category 1'), (77, 81, 'Category 1'), (111, 118, 'Category 2'), (150, 173, 'Category 3')]})
data[1]['entities'][0] = (48, 54, 'Category 1')
代表(start_offset, end_offset, entity)
.
我想读取data[0]
的每一个词,并根据data[1]
个实体进行标记。我期待最终输出,
{
'Thousands': 'O',
'of': 'O',
'demonstrators': 'O',
'have': 'O',
'marched': 'O',
'through': 'O',
'London': 'S-1',
'to': 'O',
'protest': 'O',
'the': 'O',
'war': 'O',
'in': 'O',
'Iraq': 'S-1',
'and': 'O'
'demand': 'O',
'the': 'O',
'withdrawal': 'O',
'of': 'O',
'British': 'S-2',
'troops': 'O',
'from': 'O',
'that': 'O',
'country': 'O',
'.': 'O',
'Many': 'O',
'people': 'S-3',
'have': 'B-3',
'been': 'B-3',
'killed': 'E-3',
'that': 'O',
'day': 'O',
'.': 'O'
}
这里,'O'代表'OutOfEntity','S'代表'Start','B'代表'Between','E' 代表 'End' 并且对于每个给定的文本都是唯一的。
我尝试了以下方法:
entities = {}
offsets = data[1]['entities']
for entity in offsets:
entities[data[0][entity[0]:entity[1]]] = re.findall('[0-9]+', entity[2])[0]
tags = {}
for key, value in entities.items():
entity = key.split()
if len(entity) > 1:
bEntity = entity[1:-1]
tags[entity[0]] = 'S-'+value
tags[entity[-1]] = 'E-'+value
for item in bEntity:
tags[item] = 'B-'+value
else:
tags[entity[0]] = 'S-'+value
输出将是
{'London': 'S-1',
'Iraq': 'S-1',
'British': 'S-2',
'people': 'S-3',
'killed': 'E-3',
'have': 'B-3',
'been': 'B-3'}
从这一点开始,我陷入了如何处理 'O' 实体的困境。另外,我想构建更高效和可读性更高的代码。我认为字典数据结构不会更有效地工作,因为我可以使用相同的词作为键。
def ner(data):
entities = {}
offsets = data[1]['entities']
for entity in offsets:
entities[data[0][int(entity[0]):int(entity[1])]] = re.findall('[0-9]+', entity[2])[0]
tags = []
for key, value in entities.items():
entity = key.split()
if len(entity) > 1:
bEntity = entity[1:-1]
tags.append((entity[0], 'S-'+value))
for item in bEntity:
tags.append((item, 'B-'+value))
tags.append((entity[-1], 'E-'+value))
else:
tags.append((entity[0], 'S-'+value))
tokens = nltk.word_tokenize(data[0])
OTokens = [(token, 'O') for token in tokens if token not in [token[0] for token in tags]]
for token in OTokens:
tags.append(token)
return tags