Python 在句串上滑动 Window
Python Sliding Window on sentence string
我正在寻找一个滑动 window 字符串拆分器,该拆分器由 window 大小为 N 的单词组成。
输入: "I love food and I like drink" , window 尺寸 3
输出: ["I love food", "love food and", "food and I", "and I like" .....]
window滑动的所有建议都是围绕字符串的顺序,没有条款。有开箱即用的东西吗?
def token_sliding_window(str, size):
tokens = str.split(' ')
for i in range(len(tokens )- size + 1):
yield tokens[i: i+size]
您可以使用具有不同偏移量的迭代器并将它们全部压缩。
>>> arr = "I love food. blah blah".split()
>>> its = [iter(arr), iter(arr[1:]), iter(arr[2:])] #Construct the pattern for longer windowss
>>> zip(*its)
[('I', 'love', 'food.'), ('love', 'food.', 'blah'), ('food.', 'blah', 'blah')]
如果你的句子很长,你可能想使用 izip
,或者可能是普通的旧循环(就像在其他答案中一样)。
一种基于下标字符串序列的方法:
def split_on_window(sequence="I love food and I like drink", limit=4):
results = []
split_sequence = sequence.split()
iteration_length = len(split_sequence) - (limit - 1)
max_window_indicies = range(iteration_length)
for index in max_window_indicies:
results.append(split_sequence[index:index + limit])
return results
示例输出:
>>> split_on_window("I love food and I like drink", 3)
['I', 'love', 'food']
['love', 'food', 'and']
['food', 'and', 'I']
['and', 'I', 'like']
['I', 'like', 'drink']
这是受@SuperSaiyan 启发的替代答案:
from itertools import izip
def split_on_window(sequence, limit):
split_sequence = sequence.split()
iterators = [iter(split_sequence[index:]) for index in range(limit)]
return izip(*iterators)
示例输出:
>>> list(split_on_window(s, 4))
[('I', 'love', 'food', 'and'), ('love', 'food', 'and', 'I'),
('food', 'and', 'I', 'like'), ('and', 'I', 'like', 'drink')]
基准:
Sequence = I love food and I like drink, limit = 3
Repetitions = 1000000
Using subscripting -> 3.8326420784
Using izip -> 5.41380286217 # Modified to return a list for the benchmark.
我正在寻找一个滑动 window 字符串拆分器,该拆分器由 window 大小为 N 的单词组成。
输入: "I love food and I like drink" , window 尺寸 3
输出: ["I love food", "love food and", "food and I", "and I like" .....]
window滑动的所有建议都是围绕字符串的顺序,没有条款。有开箱即用的东西吗?
def token_sliding_window(str, size):
tokens = str.split(' ')
for i in range(len(tokens )- size + 1):
yield tokens[i: i+size]
您可以使用具有不同偏移量的迭代器并将它们全部压缩。
>>> arr = "I love food. blah blah".split()
>>> its = [iter(arr), iter(arr[1:]), iter(arr[2:])] #Construct the pattern for longer windowss
>>> zip(*its)
[('I', 'love', 'food.'), ('love', 'food.', 'blah'), ('food.', 'blah', 'blah')]
如果你的句子很长,你可能想使用 izip
,或者可能是普通的旧循环(就像在其他答案中一样)。
一种基于下标字符串序列的方法:
def split_on_window(sequence="I love food and I like drink", limit=4):
results = []
split_sequence = sequence.split()
iteration_length = len(split_sequence) - (limit - 1)
max_window_indicies = range(iteration_length)
for index in max_window_indicies:
results.append(split_sequence[index:index + limit])
return results
示例输出:
>>> split_on_window("I love food and I like drink", 3)
['I', 'love', 'food']
['love', 'food', 'and']
['food', 'and', 'I']
['and', 'I', 'like']
['I', 'like', 'drink']
这是受@SuperSaiyan 启发的替代答案:
from itertools import izip
def split_on_window(sequence, limit):
split_sequence = sequence.split()
iterators = [iter(split_sequence[index:]) for index in range(limit)]
return izip(*iterators)
示例输出:
>>> list(split_on_window(s, 4))
[('I', 'love', 'food', 'and'), ('love', 'food', 'and', 'I'),
('food', 'and', 'I', 'like'), ('and', 'I', 'like', 'drink')]
基准:
Sequence = I love food and I like drink, limit = 3
Repetitions = 1000000
Using subscripting -> 3.8326420784
Using izip -> 5.41380286217 # Modified to return a list for the benchmark.