在 Python 中的文件行上实现滑动 window
Implement sliding window on file lines in Python
我正在尝试使用 Python 在 csv 文件的行上实施 sliding/moving window 方法。每行可以有一列二进制值 yes
或 no
。基本上,我想要罕见的 yes
噪音。这意味着如果说我们有 3 yes
行 在 window 的 5(最多 5 个)中,保留它们.但是,如果有 1 或 2,让我们将它们更改为 no
。我该怎么做?
比如下面的yes
都应该变成no
.
...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,no,0.20
...
但在下面,我们保持原样(可以有 5 个 window 其中 3 个是 yes
):
...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,yes,0.20
...
我尝试写一些东西,window 为 5,但卡住了(未完成):
window_size = 5
filename='C:\Users\username\v3\And-'+v3file.split("\")[5]
with open(filename) as fin:
with open('C:\Users\username\v4\And2-'+v3file.split("\")[5],'w') as finalout:
line= fin.readline()
index = 0
sequence= []
accs=[]
while line:
print(line)
for i in range(window_size):
line = fin.readline()
sequence.append(line)
index = index + 1
fin.seek(index)
这是一个基于构建连续列表理解的 5 行解决方案:
lines = [
'1,a1,b1,no,0.75',
'2,a2,b2,yes,0.45',
'3,a3,b3,yes,0.98',
'4,a4,b4,yes,0.22',
'5,a5,b5,no,0.46',
'6,a6,b6,no,0.98',
'7,a7,b7,yes,0.22',
'8,a8,b8,no,0.46',
'9,a9,b9,no,0.20']
n = len(lines)
# flag all lines containing 'yes' (add 2 empty lines at boundaries to avoid pbs)
flags = [line.count('yes') for line in ['', '']+lines+['', '']]
# count number of flags in sliding window [p-2,p+2]
counts = [sum(flags[p-2:p+3]) for p in range(2,n+2)]
# tag lines that need to be changed
tags = [flag > 0 and count < 3 for (flag,count) in zip(flags[2:],counts)]
# change tagged lines
for n in range(n):
if tags[n]: lines[n] = lines[n].replace('yes','no')
print(lines)
结果:
['1,a1,b1,no,0.75',
'2,a2,b2,yes,0.45',
'3,a3,b3,yes,0.98',
'4,a4,b4,yes,0.22',
'5,a5,b5,no,0.46',
'6,a6,b6,no,0.98',
'7,a7,b7,no,0.22',
'8,a8,b8,no,0.46',
'9,a9,b9,no,0.20']
编辑:当您从标准文本文件中读取数据时,您所要做的就是:
with file(filename,'r') as f:
lines = f.read().strip().split('\n')
(删除文件顶部或底部的潜在空白行,拆分(\n)将文件内容转换为行列表)然后使用上面的代码...
您可以使用 collections.deque
并将 maxlen
参数设置为所需的 window 大小来实现滑动 window 以跟踪 yes/no最近 5 行的标志。保留一个yeses的计数,而不是在每次迭代中计算slidingwindow中yeses的总和,这样效率更高。当你有一个全尺寸滑动 window 并且 yeses 的数量大于 2 时,将这些 yeses 的行索引添加到一个集合中,其中 yeses 应该保持原样。并且在重置输入的文件指针后的第二遍中,如果行索引不在集合中,则将 yeses 更改为 noes:
from collections import deque
window_size = 5
with open(filename) as fin, open(output_filename, 'w') as finalout:
yeses = 0
window = deque(maxlen=5)
preserved = set()
for index, line in enumerate(fin):
window.append('yes' in line)
if window[-1]:
yeses += 1
if len(window) == window_size:
if yeses > 2:
preserved.update(i for i, f in enumerate(window, index - window_size + 1) if f)
if window[0]:
yeses -= 1
fin.seek(0)
for index, line in enumerate(fin):
if index not in preserved:
line = line.replace('yes', 'no')
finalout.write(line)
演示:https://repl.it/@blhsing/StripedCleanCopyrightinfringement
我正在尝试使用 Python 在 csv 文件的行上实施 sliding/moving window 方法。每行可以有一列二进制值 yes
或 no
。基本上,我想要罕见的 yes
噪音。这意味着如果说我们有 3 yes
行 在 window 的 5(最多 5 个)中,保留它们.但是,如果有 1 或 2,让我们将它们更改为 no
。我该怎么做?
比如下面的yes
都应该变成no
.
...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,no,0.20
...
但在下面,我们保持原样(可以有 5 个 window 其中 3 个是 yes
):
...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,yes,0.20
...
我尝试写一些东西,window 为 5,但卡住了(未完成):
window_size = 5
filename='C:\Users\username\v3\And-'+v3file.split("\")[5]
with open(filename) as fin:
with open('C:\Users\username\v4\And2-'+v3file.split("\")[5],'w') as finalout:
line= fin.readline()
index = 0
sequence= []
accs=[]
while line:
print(line)
for i in range(window_size):
line = fin.readline()
sequence.append(line)
index = index + 1
fin.seek(index)
这是一个基于构建连续列表理解的 5 行解决方案:
lines = [
'1,a1,b1,no,0.75',
'2,a2,b2,yes,0.45',
'3,a3,b3,yes,0.98',
'4,a4,b4,yes,0.22',
'5,a5,b5,no,0.46',
'6,a6,b6,no,0.98',
'7,a7,b7,yes,0.22',
'8,a8,b8,no,0.46',
'9,a9,b9,no,0.20']
n = len(lines)
# flag all lines containing 'yes' (add 2 empty lines at boundaries to avoid pbs)
flags = [line.count('yes') for line in ['', '']+lines+['', '']]
# count number of flags in sliding window [p-2,p+2]
counts = [sum(flags[p-2:p+3]) for p in range(2,n+2)]
# tag lines that need to be changed
tags = [flag > 0 and count < 3 for (flag,count) in zip(flags[2:],counts)]
# change tagged lines
for n in range(n):
if tags[n]: lines[n] = lines[n].replace('yes','no')
print(lines)
结果:
['1,a1,b1,no,0.75',
'2,a2,b2,yes,0.45',
'3,a3,b3,yes,0.98',
'4,a4,b4,yes,0.22',
'5,a5,b5,no,0.46',
'6,a6,b6,no,0.98',
'7,a7,b7,no,0.22',
'8,a8,b8,no,0.46',
'9,a9,b9,no,0.20']
编辑:当您从标准文本文件中读取数据时,您所要做的就是:
with file(filename,'r') as f:
lines = f.read().strip().split('\n')
(删除文件顶部或底部的潜在空白行,拆分(\n)将文件内容转换为行列表)然后使用上面的代码...
您可以使用 collections.deque
并将 maxlen
参数设置为所需的 window 大小来实现滑动 window 以跟踪 yes/no最近 5 行的标志。保留一个yeses的计数,而不是在每次迭代中计算slidingwindow中yeses的总和,这样效率更高。当你有一个全尺寸滑动 window 并且 yeses 的数量大于 2 时,将这些 yeses 的行索引添加到一个集合中,其中 yeses 应该保持原样。并且在重置输入的文件指针后的第二遍中,如果行索引不在集合中,则将 yeses 更改为 noes:
from collections import deque
window_size = 5
with open(filename) as fin, open(output_filename, 'w') as finalout:
yeses = 0
window = deque(maxlen=5)
preserved = set()
for index, line in enumerate(fin):
window.append('yes' in line)
if window[-1]:
yeses += 1
if len(window) == window_size:
if yeses > 2:
preserved.update(i for i, f in enumerate(window, index - window_size + 1) if f)
if window[0]:
yeses -= 1
fin.seek(0)
for index, line in enumerate(fin):
if index not in preserved:
line = line.replace('yes', 'no')
finalout.write(line)
演示:https://repl.it/@blhsing/StripedCleanCopyrightinfringement