使用短语中的信息查找句子中单词的索引

Question

我需要 sentence 中 word 的索引。但有时会有重复的话。 phrase 信息会很有帮助。或者 word 列中的上一行或下一行。

基本上，我只需要识别话语中的单词，例如如果 word 是 'seaside'，我想知道它在句子中是哪个 'seaside'。我有来自 phrase 的额外信息，可以帮助我进行身份验证。它们在数据框中的出现顺序也有帮助。

我现在有的是：

file_id	phrase	word	sentence	word_indices
A	I am	I	I am a happy bird. I sing every day. I eat worms.	[0, 5, 9]
B	the seaside is	the	she is by the seaside. The seaside is packed.	[3, 5]
B	the seaside is	seaside	she is by the seaside. The seaside is packed.	[4, 6]
B	the seaside is	is	she is by the seaside. The seaside is packed.	[1, 7]
C	nobody knows	nobody	nobody knows what is going on. She can find nobody	[0, 9]
C	find nobody	nobody	nobody knows what is going on. She can find nobody	[0, 9]
D	it is such a sunny day	sunny	it is such a sunny day ah I am so happy when it's sunny such a sunny day is the best	[4, 13, 16]

但我想得到的是target列中的内容。

file_id	phrase	word	sentence	word_indices	target
A	I am	I	I am a happy bird. I sing every day. I eat worms.	[0, 5, 9]	[0]
B	the seaside is	the	she is by the seaside. The seaside is packed.	[3, 5]	[5]
B	the seaside is	seaside	she is by the seaside. The seaside is packed.	[4, 6]	[6]
B	the seaside is	is	she is by the seaside. The seaside is packed.	[1, 7]	[7]
C	nobody knows	nobody	nobody knows what is going on. She can find nobody	[0, 9]	[0]
C	find nobody	nobody	nobody knows what is going on. She can find nobody	[0, 9]	[9]
D	it is such a sunny day	sunny	it is such a sunny day ah I am so happy when it's sunny such a sunny day is the best	[4, 13, 16]	[4]

我在这里发现了一个类似的问题：但不幸的是，这是在 java 中，我需要使用 python.

的答案

非常感谢！

Answer 1

我会把它分成两步。找出句子中导致该短语的单词数，然后找到该短语中单词的单词索引号：如下所示：

def get_index_of_word_in_sentence(word, phrase, sentence):
    index1 = sentence.index(phrase)
    word_num1 = len(sentence[:index1].split())
    word_num2 = phrase.split().index(word)
    return word_num1 + word_num2

df["target"] = df.apply(lambda x: get_index_of_word_in_sentence(x["word"], x["phrase"], x["sentence"]), axis=1)

使用短语中的信息查找句子中单词的索引

find index of word in sentence with information from phrase

python

indexing

nlp

nltk

pandas