来自字符串的 Biospans 和用于 NLP 预处理的开始和结束标记字典

Question

假设我有一个简单的句子和一个字典，其中有 2 个列表作为开始和结束，其中开始有开始标记，结束有每个 BIO Span 的结束标记，我想为句子创建 BIO 标签其中 B 表示开始，I 表示内部和 O 外部，这是 NLP 中数据预处理的一个非常常用的概念，我该怎么做？

例如假设输入句子是 "I like to play soccer while he likes to run" 并且标记字典是 {'start': [0, 6], 'end': [3, 9]} 然后预期输出是 B I I I O O B I I I

您可以假设跨度不重叠

Answer 1

这个函数正是这样做的-

def spanbio(sent, toks):
    """
    Format:
    sent - Sentence - Simple String
    toks - Spans as Dictionary - Ex. {'start': [0, 13, 22], 'end': [4, 19, 27]}
    """
    samplespanstart = toks['start']
    samplespanend = toks['end'].copy()
    print("preadd", samplespanstart,samplespanend)
    for i in range(len(samplespanend)):
        samplespanend[i] = samplespanend[i] + 1
    ls = ['O']*len(sent.split())
    print(sent.split())
    print("lslen", len(ls))
    for i, j in zip(samplespanstart,samplespanend):
        print(i,j)
        for k in range(i,j):
            ls[k] = 'B'
    for i in range(1,len(ls)):
        if (ls[i-1] == 'B' and ls[i] == 'B'):
            ls[i] = 'I'
        elif (ls[i-1] == 'I' and ls[i] == 'B'):
            ls[i] = 'I'
    return ' '.join(ls)

驱动代码为运行吧-

sent = "I like to play soccer while he likes to run"
toks = {'start': [0, 6], 'end': [3, 9]}
print(spanbio(sent, toks))

输出-

B I I I O O B I I I

来自字符串的 Biospans 和用于 NLP 预处理的开始和结束标记字典

Biospans from String and Dictionary of Start and End Tokens for NLP Preprocessing

python

nlp

data-processing