获取表示为单词列表的文本中单词序列的索引

Get the indices of word sequences in texts represented as list of words

给定一个由单词列表表示的段落:

context = ['Katie', 'Joplin', 'is', 'an', 'American', 'sitcom', 'created', 'by', 'Tom', 'Seeley', 'and', 'Norm', 'Gunzenhauser', '.', 'The', 'sitcom', 'received', 'positive', 'reviews', 'thanks', 'to', 'the', 'brilliance', 'of', 'Tom', 'Seeley', '.']

以及多词目标字符串列表:

target = ['sitcom created', 'Tom Seeley']

如何获取多词目标的索引?

在这种情况下答案应该是:

[[5, 6], [8, 9], [24, 25]]

如果 context 只包含单个单词(没有空格),您可以将上下文连接到单个字符串,然后使用 str.index:

target = ["sitcom created", "Tom Seeley"]

out, joined = [], " ".join(context)
for t in target:
    try:
        idx = joined.index(t)
        cnt = joined[:idx].count(" ")
        out.append([cnt, cnt + t.count(" ")])
    except:
        continue

print(out)

打印:

[[5, 6], [8, 9]]

编辑:多次出现:

target = ["sitcom created", "Tom Seeley"]

out, joined = [], " ".join(context)
for t in target:
    idx = 0
    while True:
        try:
            idx_new = joined[idx:].index(t)
            cnt = joined[: idx + idx_new].count(" ")
            out.append([cnt, cnt + t.count(" ")])
            idx += idx_new + len(t)
        except:
            break

print(out)

打印:

[[5, 6], [8, 9], [24, 25]]

Numpy在C语言中进行运算,所以这应该是最奇怪的解决方法之一....lol....因为我们通常用它来进行数值计算 将数组转换为 numpy,然后使用广播进行比较。 它假定数组中存在多词目标

A = np.array(context)
B = np.array(['sitcom created'.split(), 'Tom Seeley'.split()])
np.argmax(A == B[..., None], axis=2)
# [[5, 6], [8, 9]]
y = np.where(A == B[..., None])
pre_x, pre_y=(None, None)
l = []
for x, y, z in np.vstack(y).T:
    if x == pre_x and y == pre_y:
        l[-1][-1].append(z)
    elif x == pre_x and y != pre_y:
        l[-1].append([z])
    else:
        l.append([[z]])
    pre_x, pre_y = x, y
# [[[5, 15], [6]], [[8, 24], [9, 25]]]

from itertools import product
sum(map(list, map(lambda x: product(*x), l)), [])
# [(5, 6), (15, 6), (8, 9), (8, 25), (24, 9), (24, 25)]