遍历唯一条目
looping over unique entries
我有一些带有文本的标记实体,我正在尝试将它们放入 SpaCy 中以使用它们来制作 ner 模型。我在制作 for 循环以使同一文本中的实体位于同一条目中时遇到问题。
示例数据:(df)
Text start end ent
Sara and Sam went to the park 0 4 Person
Sara and Sam went to the park 9 12 Person
Jake played on the swings 0 4 Person
The dog played with Tom 20 23 Person
我的尝试是:
TRIAN = []
ENTS = []
for i in len(np.unique(df['Text'])[i]):
text = df['Text'][i]
for ii in range(len(df[df['Text'] == np.unique(df['Text'])[i]]]):
Ent = [(df['start'][i + ii],[df['end'][i + ii],df['ent'][i + ii])]
ENTS.append(Ent[i + ii])
Results = [text[i], {'entities': ENTS.append(Ent[i + ii])}]
TRAIN.append(Results)
print(TRAIN)
期望的输出是:
[[“Sara 和 Sam 去了公园”,{“entities”:[[0,4,“Person”],[9,12,“Person”]]}],[“Jake 在荡秋千”, {"entities": [[0,4,"Person"]]}], ["The dog played with Tom", {entities": [[20,23,"Person"]]}]]
任何有关如何修复我的代码以生成所需输出的建议都将不胜感激。
您的数据格式化方式有点奇怪,使用起来会有点尴尬。你可以这样做。 (我将省略数据帧操作,因为它不相关。)
docs = []
ents = []
old = None # prior sentence
for row in data:
text, start, end, label = ... # split it somehow
if text != old:
# new doc, reset the ent buffer
if old is not None:
docs.append( [old, ents] )
ents = []
old = text
ents.append( (start, end, label) )
# clean up after the loop
docs.append( [text, ents] )
我有一些带有文本的标记实体,我正在尝试将它们放入 SpaCy 中以使用它们来制作 ner 模型。我在制作 for 循环以使同一文本中的实体位于同一条目中时遇到问题。
示例数据:(df)
Text start end ent
Sara and Sam went to the park 0 4 Person
Sara and Sam went to the park 9 12 Person
Jake played on the swings 0 4 Person
The dog played with Tom 20 23 Person
我的尝试是:
TRIAN = []
ENTS = []
for i in len(np.unique(df['Text'])[i]):
text = df['Text'][i]
for ii in range(len(df[df['Text'] == np.unique(df['Text'])[i]]]):
Ent = [(df['start'][i + ii],[df['end'][i + ii],df['ent'][i + ii])]
ENTS.append(Ent[i + ii])
Results = [text[i], {'entities': ENTS.append(Ent[i + ii])}]
TRAIN.append(Results)
print(TRAIN)
期望的输出是: [[“Sara 和 Sam 去了公园”,{“entities”:[[0,4,“Person”],[9,12,“Person”]]}],[“Jake 在荡秋千”, {"entities": [[0,4,"Person"]]}], ["The dog played with Tom", {entities": [[20,23,"Person"]]}]]
任何有关如何修复我的代码以生成所需输出的建议都将不胜感激。
您的数据格式化方式有点奇怪,使用起来会有点尴尬。你可以这样做。 (我将省略数据帧操作,因为它不相关。)
docs = []
ents = []
old = None # prior sentence
for row in data:
text, start, end, label = ... # split it somehow
if text != old:
# new doc, reset the ent buffer
if old is not None:
docs.append( [old, ents] )
ents = []
old = text
ents.append( (start, end, label) )
# clean up after the loop
docs.append( [text, ents] )