在 pandas 数据框中标记自定义 NER
Label custom NER in pandas dataframe
我有一个包含 3 列的数据框:分别是 type(str, list, list)
的 'text', 'in', 'tar'
。
text in tar
0 This is an example text that I use in order to ... [2] [6]
1 Discussion: We are examining the possibility of ... [3] [6, 7]
in
和 tar
代表我要标记到文本中的特定实体,它们 return 每个找到的实体术语在文本中的位置。
例如,在 in = [3]
数据框的第 2 行,我想从 text
列中取出第 3 个词(即:"are") 并将其标记为 <IN>are</IN>
.
同样,对于同一行,由于tar = [6,7]
,我还想从text
列中取出第6个和第7个字(即“可能性”, "of) 并将它们标记为 <TAR>possibility</TAR>
, <TAR>of</TAR>
.
有人可以帮我怎么做吗?
这不是最佳实现,但值得获得灵感。
data = {'text': ['This is an example text that I use in order to',
'Discussion: We are examining the possibility of the'],
'in': [[2], [3]],
'tar': [[6], [6, 7]]}
df = pd.DataFrame(data)
cols = list(df.columns)[1:]
new_text = []
for idx, row in df.iterrows():
temp = list(row['text'].split())
for pos, word in enumerate(temp):
for col in cols:
if pos in row[col]:
temp[pos] = f'<{col.upper()}>{word}</{col.upper()}>'
new_text.append(' '.join(temp))
df['text'] = new_text
print(df.text.to_list())
输出:
['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to',
'Discussion: We are <IN>examining</IN> the possibility <TAR>of</TAR> <TAR>the</TAR>']
更新 1
可以像下面这样合并连续出现的相似标签:
data = {'text': ['This is an example text that I use in order to',
'Discussion: We are examining the possibility of the'],
'in': [[2], [3, 4, 5]],
'tar': [[6], [6, 7]]}
df = pd.DataFrame(data)
cols = list(df.columns)[1:]
new_text = []
for idx, row in df.iterrows():
temp = list(row['text'].split())
for pos, word in enumerate(temp):
for col in cols:
if pos in row[col]:
temp[pos] = f'<{col.upper()}>{word}</{col.upper()}>'
new_text.append(' '.join(temp))
df['text'] = new_text
for col in cols:
df['text'] = df['text'].apply(lambda text:text.replace("</"+col.upper()+"> <"+col.upper()+">", " "))
print(df.text.to_list())
输出:
['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to', 'Discussion: We are <IN>examining the possibility</IN> <TAR>of the</TAR>']
我有一个包含 3 列的数据框:分别是 type(str, list, list)
的 'text', 'in', 'tar'
。
text in tar
0 This is an example text that I use in order to ... [2] [6]
1 Discussion: We are examining the possibility of ... [3] [6, 7]
in
和 tar
代表我要标记到文本中的特定实体,它们 return 每个找到的实体术语在文本中的位置。
例如,在 in = [3]
数据框的第 2 行,我想从 text
列中取出第 3 个词(即:"are") 并将其标记为 <IN>are</IN>
.
同样,对于同一行,由于tar = [6,7]
,我还想从text
列中取出第6个和第7个字(即“可能性”, "of) 并将它们标记为 <TAR>possibility</TAR>
, <TAR>of</TAR>
.
有人可以帮我怎么做吗?
这不是最佳实现,但值得获得灵感。
data = {'text': ['This is an example text that I use in order to',
'Discussion: We are examining the possibility of the'],
'in': [[2], [3]],
'tar': [[6], [6, 7]]}
df = pd.DataFrame(data)
cols = list(df.columns)[1:]
new_text = []
for idx, row in df.iterrows():
temp = list(row['text'].split())
for pos, word in enumerate(temp):
for col in cols:
if pos in row[col]:
temp[pos] = f'<{col.upper()}>{word}</{col.upper()}>'
new_text.append(' '.join(temp))
df['text'] = new_text
print(df.text.to_list())
输出:
['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to',
'Discussion: We are <IN>examining</IN> the possibility <TAR>of</TAR> <TAR>the</TAR>']
更新 1
可以像下面这样合并连续出现的相似标签:
data = {'text': ['This is an example text that I use in order to',
'Discussion: We are examining the possibility of the'],
'in': [[2], [3, 4, 5]],
'tar': [[6], [6, 7]]}
df = pd.DataFrame(data)
cols = list(df.columns)[1:]
new_text = []
for idx, row in df.iterrows():
temp = list(row['text'].split())
for pos, word in enumerate(temp):
for col in cols:
if pos in row[col]:
temp[pos] = f'<{col.upper()}>{word}</{col.upper()}>'
new_text.append(' '.join(temp))
df['text'] = new_text
for col in cols:
df['text'] = df['text'].apply(lambda text:text.replace("</"+col.upper()+"> <"+col.upper()+">", " "))
print(df.text.to_list())
输出:
['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to', 'Discussion: We are <IN>examining the possibility</IN> <TAR>of the</TAR>']