Python: 从另一个文本列中查找子文本列的开始、结束索引
Python: Find starting, ending index of sub-text column from another text column
我正在尝试问答,必须制作自己的数据集。我有 5 列:
question | context | answer | answer_start | answer_end
context
列中的每条记录都有一段文本,例如,
Neil Alden Armstrong (August 5, 1930 – August 25, 2012) was an American astronaut and aeronautical engineer, and the first person to walk on the Moon. He was also a naval aviator, test pilot, and university professor.
对应的answer
包含从context
中提取的一串文本,例如
the first person to walk on the Moon
我需要填充 answer_start
和 answer_end
,它们是 context
中 answer
文本的 starting/ending 索引。在上面的示例中,answer_start
将是 114,而 answer_end
将是 150。它们当前是空列。
我尝试了以下方法:
df['answer_start'].apply(lambda x: re.search(x['answer'], x['context']).start())
但是它抛出了一个错误:
TypeError: 'int' object is not subscriptable
有什么方法可以解决我的问题吗?有没有不需要循环的方法?
您应该使用 df.apply
而不是 df['answer_start'].apply
。因为你在一个系列上使用 apply 你得到的 x 不是一行而是一个整数
尝试:
df = pd.DataFrame({'context':
['The cat sat on the mat', 'Around the world we go'],
'answer': ['mat', 'world']})
df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
df['answer_end'] = df['answer'].str.len() + df['answer_start']
print(df)
context answer answer_start answer_end
0 The cat sat on the mat mat 19 22
1 Around the world we go world 11 16
尝试:
df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
df['answer_end'] = df['answer_start'] + df['answer'].str.len()
>>> df[['answer_start', 'answer_end']]
answer_start answer_end
0 113 149
您可以使用 context
和 answer
字段来计算指数。要使用多个字段,您应该使用 df.apply
。例如我创建了一个玩具数据集:
import pandas as pd
text = "this is text number"
df = pd.DataFrame({"A": [f"{text} {i+1}" for i in range(4)], "B": text.split(" ")})
数据如下:
A B
0 this is text number 1 this
1 this is text number 2 is
2 this is text number 3 text
3 this is text number 4 number
现在我们可以计算 start
和 end
索引值:
df["start"] = df.apply(lambda row: row["A"].find(row["B"]), axis=1)
df["end"] = df.apply(lambda row: row["start"] + len(row["B"]), axis=1)
这是结果:
A B start end
0 this is text number 1 this 0 4
1 this is text number 2 is 2 4
2 this is text number 3 text 8 12
3 this is text number 4 number 13 19
我正在尝试问答,必须制作自己的数据集。我有 5 列:
question | context | answer | answer_start | answer_end
context
列中的每条记录都有一段文本,例如,
Neil Alden Armstrong (August 5, 1930 – August 25, 2012) was an American astronaut and aeronautical engineer, and the first person to walk on the Moon. He was also a naval aviator, test pilot, and university professor.
对应的answer
包含从context
中提取的一串文本,例如
the first person to walk on the Moon
我需要填充 answer_start
和 answer_end
,它们是 context
中 answer
文本的 starting/ending 索引。在上面的示例中,answer_start
将是 114,而 answer_end
将是 150。它们当前是空列。
我尝试了以下方法:
df['answer_start'].apply(lambda x: re.search(x['answer'], x['context']).start())
但是它抛出了一个错误:
TypeError: 'int' object is not subscriptable
有什么方法可以解决我的问题吗?有没有不需要循环的方法?
您应该使用 df.apply
而不是 df['answer_start'].apply
。因为你在一个系列上使用 apply 你得到的 x 不是一行而是一个整数
尝试:
df = pd.DataFrame({'context':
['The cat sat on the mat', 'Around the world we go'],
'answer': ['mat', 'world']})
df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
df['answer_end'] = df['answer'].str.len() + df['answer_start']
print(df)
context answer answer_start answer_end
0 The cat sat on the mat mat 19 22
1 Around the world we go world 11 16
尝试:
df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
df['answer_end'] = df['answer_start'] + df['answer'].str.len()
>>> df[['answer_start', 'answer_end']]
answer_start answer_end
0 113 149
您可以使用 context
和 answer
字段来计算指数。要使用多个字段,您应该使用 df.apply
。例如我创建了一个玩具数据集:
import pandas as pd
text = "this is text number"
df = pd.DataFrame({"A": [f"{text} {i+1}" for i in range(4)], "B": text.split(" ")})
数据如下:
A B
0 this is text number 1 this
1 this is text number 2 is
2 this is text number 3 text
3 this is text number 4 number
现在我们可以计算 start
和 end
索引值:
df["start"] = df.apply(lambda row: row["A"].find(row["B"]), axis=1)
df["end"] = df.apply(lambda row: row["start"] + len(row["B"]), axis=1)
这是结果:
A B start end
0 this is text number 1 this 0 4
1 this is text number 2 is 2 4
2 this is text number 3 text 8 12
3 this is text number 4 number 13 19