来自字符串的 Biospans 和用于 NLP 预处理的开始和结束标记字典
Biospans from String and Dictionary of Start and End Tokens for NLP Preprocessing
假设我有一个简单的句子和一个字典,其中有 2 个列表作为开始和结束,其中开始有开始标记,结束有每个 BIO Span 的结束标记,我想为句子创建 BIO 标签其中 B 表示开始,I 表示内部和 O 外部,这是 NLP 中数据预处理的一个非常常用的概念,我该怎么做?
例如假设输入句子是 "I like to play soccer while he likes to run"
并且标记字典是 {'start': [0, 6], 'end': [3, 9]}
然后预期输出是 B I I I O O B I I I
您可以假设跨度不重叠
这个函数正是这样做的-
def spanbio(sent, toks):
"""
Format:
sent - Sentence - Simple String
toks - Spans as Dictionary - Ex. {'start': [0, 13, 22], 'end': [4, 19, 27]}
"""
samplespanstart = toks['start']
samplespanend = toks['end'].copy()
print("preadd", samplespanstart,samplespanend)
for i in range(len(samplespanend)):
samplespanend[i] = samplespanend[i] + 1
ls = ['O']*len(sent.split())
print(sent.split())
print("lslen", len(ls))
for i, j in zip(samplespanstart,samplespanend):
print(i,j)
for k in range(i,j):
ls[k] = 'B'
for i in range(1,len(ls)):
if (ls[i-1] == 'B' and ls[i] == 'B'):
ls[i] = 'I'
elif (ls[i-1] == 'I' and ls[i] == 'B'):
ls[i] = 'I'
return ' '.join(ls)
驱动代码为运行吧-
sent = "I like to play soccer while he likes to run"
toks = {'start': [0, 6], 'end': [3, 9]}
print(spanbio(sent, toks))
输出-
B I I I O O B I I I
假设我有一个简单的句子和一个字典,其中有 2 个列表作为开始和结束,其中开始有开始标记,结束有每个 BIO Span 的结束标记,我想为句子创建 BIO 标签其中 B 表示开始,I 表示内部和 O 外部,这是 NLP 中数据预处理的一个非常常用的概念,我该怎么做?
例如假设输入句子是 "I like to play soccer while he likes to run"
并且标记字典是 {'start': [0, 6], 'end': [3, 9]}
然后预期输出是 B I I I O O B I I I
您可以假设跨度不重叠
这个函数正是这样做的-
def spanbio(sent, toks):
"""
Format:
sent - Sentence - Simple String
toks - Spans as Dictionary - Ex. {'start': [0, 13, 22], 'end': [4, 19, 27]}
"""
samplespanstart = toks['start']
samplespanend = toks['end'].copy()
print("preadd", samplespanstart,samplespanend)
for i in range(len(samplespanend)):
samplespanend[i] = samplespanend[i] + 1
ls = ['O']*len(sent.split())
print(sent.split())
print("lslen", len(ls))
for i, j in zip(samplespanstart,samplespanend):
print(i,j)
for k in range(i,j):
ls[k] = 'B'
for i in range(1,len(ls)):
if (ls[i-1] == 'B' and ls[i] == 'B'):
ls[i] = 'I'
elif (ls[i-1] == 'I' and ls[i] == 'B'):
ls[i] = 'I'
return ' '.join(ls)
驱动代码为运行吧-
sent = "I like to play soccer while he likes to run"
toks = {'start': [0, 6], 'end': [3, 9]}
print(spanbio(sent, toks))
输出-
B I I I O O B I I I