获取表示为单词列表的文本中单词序列的索引
Get the indices of word sequences in texts represented as list of words
给定一个由单词列表表示的段落:
context = ['Katie', 'Joplin', 'is', 'an', 'American', 'sitcom', 'created', 'by', 'Tom', 'Seeley', 'and', 'Norm', 'Gunzenhauser', '.', 'The', 'sitcom', 'received', 'positive', 'reviews', 'thanks', 'to', 'the', 'brilliance', 'of', 'Tom', 'Seeley', '.']
以及多词目标字符串列表:
target = ['sitcom created', 'Tom Seeley']
如何获取多词目标的索引?
在这种情况下答案应该是:
[[5, 6], [8, 9], [24, 25]]
如果 context
只包含单个单词(没有空格),您可以将上下文连接到单个字符串,然后使用 str.index
:
target = ["sitcom created", "Tom Seeley"]
out, joined = [], " ".join(context)
for t in target:
try:
idx = joined.index(t)
cnt = joined[:idx].count(" ")
out.append([cnt, cnt + t.count(" ")])
except:
continue
print(out)
打印:
[[5, 6], [8, 9]]
编辑:多次出现:
target = ["sitcom created", "Tom Seeley"]
out, joined = [], " ".join(context)
for t in target:
idx = 0
while True:
try:
idx_new = joined[idx:].index(t)
cnt = joined[: idx + idx_new].count(" ")
out.append([cnt, cnt + t.count(" ")])
idx += idx_new + len(t)
except:
break
print(out)
打印:
[[5, 6], [8, 9], [24, 25]]
Numpy在C语言中进行运算,所以这应该是最奇怪的解决方法之一....lol....因为我们通常用它来进行数值计算
将数组转换为 numpy,然后使用广播进行比较。
它假定数组中存在多词目标
A = np.array(context)
B = np.array(['sitcom created'.split(), 'Tom Seeley'.split()])
np.argmax(A == B[..., None], axis=2)
# [[5, 6], [8, 9]]
y = np.where(A == B[..., None])
pre_x, pre_y=(None, None)
l = []
for x, y, z in np.vstack(y).T:
if x == pre_x and y == pre_y:
l[-1][-1].append(z)
elif x == pre_x and y != pre_y:
l[-1].append([z])
else:
l.append([[z]])
pre_x, pre_y = x, y
# [[[5, 15], [6]], [[8, 24], [9, 25]]]
from itertools import product
sum(map(list, map(lambda x: product(*x), l)), [])
# [(5, 6), (15, 6), (8, 9), (8, 25), (24, 9), (24, 25)]
给定一个由单词列表表示的段落:
context = ['Katie', 'Joplin', 'is', 'an', 'American', 'sitcom', 'created', 'by', 'Tom', 'Seeley', 'and', 'Norm', 'Gunzenhauser', '.', 'The', 'sitcom', 'received', 'positive', 'reviews', 'thanks', 'to', 'the', 'brilliance', 'of', 'Tom', 'Seeley', '.']
以及多词目标字符串列表:
target = ['sitcom created', 'Tom Seeley']
如何获取多词目标的索引?
在这种情况下答案应该是:
[[5, 6], [8, 9], [24, 25]]
如果 context
只包含单个单词(没有空格),您可以将上下文连接到单个字符串,然后使用 str.index
:
target = ["sitcom created", "Tom Seeley"]
out, joined = [], " ".join(context)
for t in target:
try:
idx = joined.index(t)
cnt = joined[:idx].count(" ")
out.append([cnt, cnt + t.count(" ")])
except:
continue
print(out)
打印:
[[5, 6], [8, 9]]
编辑:多次出现:
target = ["sitcom created", "Tom Seeley"]
out, joined = [], " ".join(context)
for t in target:
idx = 0
while True:
try:
idx_new = joined[idx:].index(t)
cnt = joined[: idx + idx_new].count(" ")
out.append([cnt, cnt + t.count(" ")])
idx += idx_new + len(t)
except:
break
print(out)
打印:
[[5, 6], [8, 9], [24, 25]]
Numpy在C语言中进行运算,所以这应该是最奇怪的解决方法之一....lol....因为我们通常用它来进行数值计算 将数组转换为 numpy,然后使用广播进行比较。 它假定数组中存在多词目标
A = np.array(context)
B = np.array(['sitcom created'.split(), 'Tom Seeley'.split()])
np.argmax(A == B[..., None], axis=2)
# [[5, 6], [8, 9]]
y = np.where(A == B[..., None])
pre_x, pre_y=(None, None)
l = []
for x, y, z in np.vstack(y).T:
if x == pre_x and y == pre_y:
l[-1][-1].append(z)
elif x == pre_x and y != pre_y:
l[-1].append([z])
else:
l.append([[z]])
pre_x, pre_y = x, y
# [[[5, 15], [6]], [[8, 24], [9, 25]]]
from itertools import product
sum(map(list, map(lambda x: product(*x), l)), [])
# [(5, 6), (15, 6), (8, 9), (8, 25), (24, 9), (24, 25)]