遍历唯一条目

Question

我有一些带有文本的标记实体，我正在尝试将它们放入 SpaCy 中以使用它们来制作 ner 模型。我在制作 for 循环以使同一文本中的实体位于同一条目中时遇到问题。

示例数据：(df)

Text                              start     end     ent
Sara and Sam went to the park     0         4       Person
Sara and Sam went to the park     9         12      Person
Jake played on the swings         0         4       Person
The dog played with Tom           20        23      Person

我的尝试是：

TRIAN = []
ENTS = []
for i in len(np.unique(df['Text'])[i]):
    text = df['Text'][i]
    for ii in range(len(df[df['Text'] == np.unique(df['Text'])[i]]]):
        Ent = [(df['start'][i + ii],[df['end'][i + ii],df['ent'][i + ii])]
        ENTS.append(Ent[i + ii])
        Results = [text[i], {'entities': ENTS.append(Ent[i + ii])}]
        TRAIN.append(Results)
print(TRAIN)

期望的输出是： [[“Sara 和 Sam 去了公园”，{“entities”：[[0,4，“Person”]，[9,12，“Person”]]}]，[“Jake 在荡秋千”， {"entities": [[0,4,"Person"]]}], ["The dog played with Tom", {entities": [[20,23,"Person"]]}]]

任何有关如何修复我的代码以生成所需输出的建议都将不胜感激。

Answer 1

您的数据格式化方式有点奇怪，使用起来会有点尴尬。你可以这样做。（我将省略数据帧操作，因为它不相关。）

docs = []
ents = []
old = None # prior sentence
for row in data:
    text, start, end, label = ... # split it somehow
    if text != old:
        # new doc, reset the ent buffer
        if old is not None:
            docs.append( [old, ents] )
        ents = []
        old = text
    ents.append( (start, end, label) )
# clean up after the loop
docs.append( [text, ents] )

遍历唯一条目

looping over unique entries

python

named-entity-recognition

spacy