在一定范围内找到有间隙的子列表
finding gappy sublists within a certain range
最近 我想在更大的列表中查找子列表。我有一个类似但略有不同的问题。假设我有这个列表:
[['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
并且我想使用 matches = ['she', 'is', 'student']
查询它,目的是从查询列表中以相同的顺序获取包含 matches
元素的所有子列表。与 link 中问题的唯一区别是我想向 find_gappy
函数添加一个 range
参数,这样它就不会检索元素之间存在间隙的子列表超出指定范围。例如,在上面的例子中,我想要一个这样的函数:
matches = ['she', 'is', 'student']
x = [i for i in x if find_gappy(i, matches, range=2)]
这会 return:
[['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
最后一个元素没有出现,因为在句子 she is a very very exceptionally good student
中,a
和 good
之间的距离超出了范围限制。
这样的函数怎么写呢?差距
链接问题中的技术足够好,您只需要沿途添加间隙计数,并且由于您不想要全局计数,因此在遇到匹配项时重置计数器。类似于:
import collections
def find_gappy(source, matches, max_gap=-1):
matches = collections.deque(matches)
counter = max_gap # initialize as -1 if you want to begin counting AFTER the first match
for word in source:
if word == matches[0]:
counter = max_gap # or remove this for global gap counting
matches.popleft()
if not matches:
return True
else:
counter -= 1
if counter == -1:
return False
return False
data = [['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
matches = ['she', 'is', 'student']
x = [i for i in data if find_gappy(i, matches, 2)]
# [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
作为奖励,您可以将其用作原始函数,仅当您将正数作为 max_gap
传递时才应用间隙计数。
这是一种将 match
列表中的项目顺序考虑在内的方法:
In [102]: def find_gappy(all_sets, matches, gap_range=2):
...: zip_m = list(zip(matches, matches[1:]))
...: for lst in all_sets:
...: indices = {j: i for i, j in enumerate(lst)}
...: try:
...: if all(0 <= indices[j]-indices[i] - 1 <= gap_range for i, j in zip_m):
...: yield lst
...: except KeyError:
...: pass
...:
...:
演示:
In [110]: lst = [['she', 'is', 'a', 'student'],
...: ['student', 'she', 'is', 'a', 'lawer'], # for order check
...: ['she', 'is', 'a', 'great', 'student'],
...: ['i', 'am', 'a', 'teacher'],
...: ['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
...:
In [111]:
In [111]: list(find_gappy(lst, ['she', 'is', 'student'], gap_range=2))
Out[111]: [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
如果你的子列表中有重复的单词,你可以使用defaultdict()
来跟踪所有索引并使用itertools.prodcut
来比较所有单词对有序产品的差距。
In [9]: from collections import defaultdict
In [10]: from itertools import product
In [10]: def find_gappy(all_sets, matches, gap_range=2):
...: zip_m = list(zip(matches, matches[1:]))
...: for lst in all_sets:
...: indices = defaultdict(list)
...: for i, j in enumerate(lst):
...: indices[j].append(i)
...: try:
...: if all(any(0 <= v - k - 1 <= gap_range for k, v in product(indices[j], indices[i])) for i, j in zip_m):
...: yield lst
...: except KeyError:
...: pass
最近
[['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
并且我想使用 matches = ['she', 'is', 'student']
查询它,目的是从查询列表中以相同的顺序获取包含 matches
元素的所有子列表。与 link 中问题的唯一区别是我想向 find_gappy
函数添加一个 range
参数,这样它就不会检索元素之间存在间隙的子列表超出指定范围。例如,在上面的例子中,我想要一个这样的函数:
matches = ['she', 'is', 'student']
x = [i for i in x if find_gappy(i, matches, range=2)]
这会 return:
[['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
最后一个元素没有出现,因为在句子 she is a very very exceptionally good student
中,a
和 good
之间的距离超出了范围限制。
这样的函数怎么写呢?差距
链接问题中的技术足够好,您只需要沿途添加间隙计数,并且由于您不想要全局计数,因此在遇到匹配项时重置计数器。类似于:
import collections
def find_gappy(source, matches, max_gap=-1):
matches = collections.deque(matches)
counter = max_gap # initialize as -1 if you want to begin counting AFTER the first match
for word in source:
if word == matches[0]:
counter = max_gap # or remove this for global gap counting
matches.popleft()
if not matches:
return True
else:
counter -= 1
if counter == -1:
return False
return False
data = [['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
matches = ['she', 'is', 'student']
x = [i for i in data if find_gappy(i, matches, 2)]
# [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
作为奖励,您可以将其用作原始函数,仅当您将正数作为 max_gap
传递时才应用间隙计数。
这是一种将 match
列表中的项目顺序考虑在内的方法:
In [102]: def find_gappy(all_sets, matches, gap_range=2):
...: zip_m = list(zip(matches, matches[1:]))
...: for lst in all_sets:
...: indices = {j: i for i, j in enumerate(lst)}
...: try:
...: if all(0 <= indices[j]-indices[i] - 1 <= gap_range for i, j in zip_m):
...: yield lst
...: except KeyError:
...: pass
...:
...:
演示:
In [110]: lst = [['she', 'is', 'a', 'student'],
...: ['student', 'she', 'is', 'a', 'lawer'], # for order check
...: ['she', 'is', 'a', 'great', 'student'],
...: ['i', 'am', 'a', 'teacher'],
...: ['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
...:
In [111]:
In [111]: list(find_gappy(lst, ['she', 'is', 'student'], gap_range=2))
Out[111]: [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
如果你的子列表中有重复的单词,你可以使用defaultdict()
来跟踪所有索引并使用itertools.prodcut
来比较所有单词对有序产品的差距。
In [9]: from collections import defaultdict
In [10]: from itertools import product
In [10]: def find_gappy(all_sets, matches, gap_range=2):
...: zip_m = list(zip(matches, matches[1:]))
...: for lst in all_sets:
...: indices = defaultdict(list)
...: for i, j in enumerate(lst):
...: indices[j].append(i)
...: try:
...: if all(any(0 <= v - k - 1 <= gap_range for k, v in product(indices[j], indices[i])) for i, j in zip_m):
...: yield lst
...: except KeyError:
...: pass