无法迭代 cStringIO
Cannot Iterate over cStringIO
在脚本中,我正在将行写入文件,但有些行可能是重复的。所以我创建了一个临时的 cStringIO
类文件对象,我称之为 "intermediate file"。我先将行写入中间文件,删除重复项,然后写入真实文件。
所以我写了一个简单的 for 循环来遍历中间文件中的每一行并删除所有重复项。
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
cStringIO.OutputType.getvalue(f_temp) # From:
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
我的问题是 for
循环永远不会执行。我可以通过在我的调试器中放置一个断点来验证这一点;该行代码被跳过,函数退出。我什至阅读了 并插入了代码 cStringIO.OutputType.getvalue(f_temp)
,但这并没有解决我的问题。
我不知道为什么我不能读取和遍历我的类文件对象。
您引用的答案有点不完整。它告诉我们如何将 cStringIO 缓冲区作为字符串获取,但是您必须对该字符串做一些事情。你可以这样做:
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# contents = cStringIO.OutputType.getvalue(f_temp) # From:
contents = f_temp.getvalue() # simpler approach
contents = contents.strip('\n') # remove final newline to avoid adding an extra row
lines = contents.split('\n') # convert to iterable
for line in lines: # Iterate through the list of lines.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line + '\n')
lines_seen.add(line)
f_out.close()
但是在f_temp"file handle"上使用正常的IO操作可能会更好,像这样:
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# move f_temp's pointer back to the start of the file, to allow reading
f_temp.seek(0)
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
这是一个测试(其中一个):
import cStringIO, os
def define_outputs(dir_out):
return open('/tmp/test.txt', 'w')
def compute_md5(line):
return line
f = cStringIO.StringIO()
f.write('string 1\n')
f.write('string 2\n')
f.write('string 1\n')
f.write('string 2\n')
f.write('string 3\n')
remove_duplicates(f, 'tmp')
with open('/tmp/test.txt', 'r') as f:
print(str([row for row in f]))
# ['string 1\n', 'string 2\n', 'string 3\n']
在脚本中,我正在将行写入文件,但有些行可能是重复的。所以我创建了一个临时的 cStringIO
类文件对象,我称之为 "intermediate file"。我先将行写入中间文件,删除重复项,然后写入真实文件。
所以我写了一个简单的 for 循环来遍历中间文件中的每一行并删除所有重复项。
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
cStringIO.OutputType.getvalue(f_temp) # From:
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
我的问题是 for
循环永远不会执行。我可以通过在我的调试器中放置一个断点来验证这一点;该行代码被跳过,函数退出。我什至阅读了 cStringIO.OutputType.getvalue(f_temp)
,但这并没有解决我的问题。
我不知道为什么我不能读取和遍历我的类文件对象。
您引用的答案有点不完整。它告诉我们如何将 cStringIO 缓冲区作为字符串获取,但是您必须对该字符串做一些事情。你可以这样做:
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# contents = cStringIO.OutputType.getvalue(f_temp) # From:
contents = f_temp.getvalue() # simpler approach
contents = contents.strip('\n') # remove final newline to avoid adding an extra row
lines = contents.split('\n') # convert to iterable
for line in lines: # Iterate through the list of lines.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line + '\n')
lines_seen.add(line)
f_out.close()
但是在f_temp"file handle"上使用正常的IO操作可能会更好,像这样:
def remove_duplicates(f_temp, dir_out): # f_temp is the cStringIO object.
"""Function to remove duplicates from the intermediate file and write to physical file."""
lines_seen = set() # Define a set to hold lines already seen.
f_out = define_outputs(dir_out) # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.
# move f_temp's pointer back to the start of the file, to allow reading
f_temp.seek(0)
for line in f_temp: # Iterate through the cStringIO file-like object.
line = compute_md5(line) # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
if line not in lines_seen: # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
f_out.write(line)
lines_seen.add(line)
f_out.close()
这是一个测试(其中一个):
import cStringIO, os
def define_outputs(dir_out):
return open('/tmp/test.txt', 'w')
def compute_md5(line):
return line
f = cStringIO.StringIO()
f.write('string 1\n')
f.write('string 2\n')
f.write('string 1\n')
f.write('string 2\n')
f.write('string 3\n')
remove_duplicates(f, 'tmp')
with open('/tmp/test.txt', 'r') as f:
print(str([row for row in f]))
# ['string 1\n', 'string 2\n', 'string 3\n']