Python: 从另一个文本列中查找子文本列的开始、结束索引

Python: Find starting, ending index of sub-text column from another text column

我正在尝试问答,必须制作自己的数据集。我有 5 列:

question | context | answer | answer_start | answer_end

context 列中的每条记录都有一段文本,例如,

Neil Alden Armstrong (August 5, 1930 – August 25, 2012) was an American astronaut and aeronautical engineer, and the first person to walk on the Moon. He was also a naval aviator, test pilot, and university professor.

对应的answer包含从context中提取的一串文本,例如

the first person to walk on the Moon

我需要填充 answer_startanswer_end,它们是 contextanswer 文本的 starting/ending 索引。在上面的示例中,answer_start 将是 114,而 answer_end 将是 150。它们当前是空列。

我尝试了以下方法:

df['answer_start'].apply(lambda x: re.search(x['answer'], x['context']).start())

但是它抛出了一个错误:

TypeError: 'int' object is not subscriptable

有什么方法可以解决我的问题吗?有没有不需要循环的方法?

您应该使用 df.apply 而不是 df['answer_start'].apply。因为你在一个系列上使用 apply 你得到的 x 不是一行而是一个整数

尝试:

df = pd.DataFrame({'context': 
                   ['The cat sat on the mat', 'Around the world we go'],
                   'answer': ['mat', 'world']})


df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
df['answer_end'] = df['answer'].str.len() + df['answer_start']

print(df)

                  context answer  answer_start  answer_end
0  The cat sat on the mat    mat            19          22
1  Around the world we go  world            11          16

尝试:

df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
df['answer_end'] = df['answer_start'] + df['answer'].str.len()
>>> df[['answer_start', 'answer_end']]
   answer_start  answer_end
0           113         149

您可以使用 contextanswer 字段来计算指数。要使用多个字段,您应该使用 df.apply。例如我创建了一个玩具数据集:

import pandas as pd

text = "this is text number"

df = pd.DataFrame({"A": [f"{text} {i+1}" for i in range(4)], "B": text.split(" ")})

数据如下:

                       A       B
0  this is text number 1    this
1  this is text number 2      is
2  this is text number 3    text
3  this is text number 4  number

现在我们可以计算 startend 索引值:

df["start"] = df.apply(lambda row: row["A"].find(row["B"]), axis=1)
df["end"] = df.apply(lambda row: row["start"] + len(row["B"]), axis=1)

这是结果:

                       A       B  start  end
0  this is text number 1    this      0    4
1  this is text number 2      is      2    4
2  this is text number 3    text      8   12
3  this is text number 4  number     13   19