使用 os.listdir 时如何跳过其中没有任何内容的文件
How can I skip a file when using os.listdir that has nothing written in it
我正在尝试浏览一个目录中的所有文件并制作另一个文本文件,最终允许我根据我的数据绘制图表。一些 VariantLine# 文件中不包含任何信息,因为在我的任何菌株中都没有发现这些变体。当我开始遍历我的 for 循环时,它说我的列表索引超出范围,但这种情况发生在没有写入任何内容的文件中。我有超过 10,000 个 VariantLine# 文件,所以我不想逐个检查每个文件并删除所有其中没有任何内容的文件。我只想解析那些确实包含其中所写信息的内容,因为这些内容将为我提供制作情节所需的信息。到目前为止我发现的唯一信息只能处理跳过没有信息的行,而不是整个文件。
for files in os.listdir("/nobackup/rogers_research/tmiorin/DsantRNAproject"):
if re.search("^VariantLine", files):
filename=files
filenumber=filename[11:]
print filenumber
for line in filename:
stuff=line.split()
strain=stuff[0]
chrom=stuff[1]
posone=stuff[2]
postwo=stuff[3]
本质上我的问题是我需要一种方法来仅解析其中写入内容的文件,因此理想情况下我需要在 "for line in filename" 之前放置一行代码来读取文件并且仅如果它确实打印了内容,则继续 for 循环。我似乎无法在网上找到任何信息,所以如果有人碰巧知道我可以说些什么,我将不胜感激。谢谢!
for files in os.listdir("/nobackup/rogers_research/tmiorin/DsantRNAproject"):
if re.search("^VariantLine", files):
filename=files
filenumber=filename[11:]
print filenumber
for line in filename:
if (not line==""):
stuff=line.split()
strain=stuff[0]
chrom=stuff[1]
posone=stuff[2]
postwo=stuff[3]
if (not line==""): 检查该行是否不为空,如果你希望你可以检查 even line is not equal to "\n" 如果这不起作用
当遍历文件时,首先检查它们的大小并且只处理大小 > 0 的文件:
if os.stat(filename).st_size > 0:
<work>
这段代码不仅要检查拆分结果的 return 长度,而且还可以通过避免首先打开空文件来优化自身,如下所示:
DATA_FILE_PREFIX = 'VariantLine'
# We expect each line of the file to contain 4 records and we will separate them
# with a split operation.
# Split, by default, tries to split on whitespace. Therefore, each file should
# contain a minumum of (4 data bytes + 3 delimiting bytes) = 7 total bytes
MIN_DATA_FILE_BYTE_SIZE = 7
# Get contents form directory as os.DirEntry objects
dir_contents = os.scandir("/nobackup/rogers_research/tmiorin/DsantRNAproject")
# Filter directory contents to ensure that we only look at FILES, whose names
# match our known file prefix, and whose size in bytes is greater than min.
data_files_in_dir = [
file_result
for file_result in dir_contents
if (file_result.is_file()
and file_result.name.startswith(DATA_FILE_PREFIX)
and file_result.stat().st_size >= MIN_DATA_FILE_BYTE_SIZE)
]
# Just calling this out explicitly so we can avoid calling len() each iteration
LEN_OF_FILE_PREFIX = len(DATA_FILE_PREFIX);
# Open all data files and read them.
for file_result in data_files_in_dir:
file_name = file_result.name
file_number = file_name[LEN_OF_FILE_PREFIX:]
with open(file_name, 'r') as data_file_handle:
for line in data_file_handle:
stuff=line.split()
# You might want to modify this condition to be 'length == 4'
# Not sure how much you value your data quality, but in some
# circumstances, I might be alarmed if I had more than 4 records in
# a given line, as that might indicate data corruption and/or an
# error in the collection method.
if(len(stuff) >= 4):
strain=stuff[0]
chrom=stuff[1]
posone=stuff[2]
postwo=stuff[3]
do_something_with_data(strain, chrom, posone, postwo)
# End open(file)
# End of for-loop over directory results
此方法利用前提条件来避免打开空文件 和 逻辑上无法包含足够有价值数据的文件,从而优化 I/O。此外,它在拆分文件后添加检查以确保给定行内的数据至少包含四个成员。此外,此解决方案取消了 RegEx 匹配文件名,如果代码只需要确认文件名以字符序列开头,则效率非常低。
我正在尝试浏览一个目录中的所有文件并制作另一个文本文件,最终允许我根据我的数据绘制图表。一些 VariantLine# 文件中不包含任何信息,因为在我的任何菌株中都没有发现这些变体。当我开始遍历我的 for 循环时,它说我的列表索引超出范围,但这种情况发生在没有写入任何内容的文件中。我有超过 10,000 个 VariantLine# 文件,所以我不想逐个检查每个文件并删除所有其中没有任何内容的文件。我只想解析那些确实包含其中所写信息的内容,因为这些内容将为我提供制作情节所需的信息。到目前为止我发现的唯一信息只能处理跳过没有信息的行,而不是整个文件。
for files in os.listdir("/nobackup/rogers_research/tmiorin/DsantRNAproject"):
if re.search("^VariantLine", files):
filename=files
filenumber=filename[11:]
print filenumber
for line in filename:
stuff=line.split()
strain=stuff[0]
chrom=stuff[1]
posone=stuff[2]
postwo=stuff[3]
本质上我的问题是我需要一种方法来仅解析其中写入内容的文件,因此理想情况下我需要在 "for line in filename" 之前放置一行代码来读取文件并且仅如果它确实打印了内容,则继续 for 循环。我似乎无法在网上找到任何信息,所以如果有人碰巧知道我可以说些什么,我将不胜感激。谢谢!
for files in os.listdir("/nobackup/rogers_research/tmiorin/DsantRNAproject"):
if re.search("^VariantLine", files):
filename=files
filenumber=filename[11:]
print filenumber
for line in filename:
if (not line==""):
stuff=line.split()
strain=stuff[0]
chrom=stuff[1]
posone=stuff[2]
postwo=stuff[3]
if (not line==""): 检查该行是否不为空,如果你希望你可以检查 even line is not equal to "\n" 如果这不起作用
当遍历文件时,首先检查它们的大小并且只处理大小 > 0 的文件:
if os.stat(filename).st_size > 0:
<work>
这段代码不仅要检查拆分结果的 return 长度,而且还可以通过避免首先打开空文件来优化自身,如下所示:
DATA_FILE_PREFIX = 'VariantLine'
# We expect each line of the file to contain 4 records and we will separate them
# with a split operation.
# Split, by default, tries to split on whitespace. Therefore, each file should
# contain a minumum of (4 data bytes + 3 delimiting bytes) = 7 total bytes
MIN_DATA_FILE_BYTE_SIZE = 7
# Get contents form directory as os.DirEntry objects
dir_contents = os.scandir("/nobackup/rogers_research/tmiorin/DsantRNAproject")
# Filter directory contents to ensure that we only look at FILES, whose names
# match our known file prefix, and whose size in bytes is greater than min.
data_files_in_dir = [
file_result
for file_result in dir_contents
if (file_result.is_file()
and file_result.name.startswith(DATA_FILE_PREFIX)
and file_result.stat().st_size >= MIN_DATA_FILE_BYTE_SIZE)
]
# Just calling this out explicitly so we can avoid calling len() each iteration
LEN_OF_FILE_PREFIX = len(DATA_FILE_PREFIX);
# Open all data files and read them.
for file_result in data_files_in_dir:
file_name = file_result.name
file_number = file_name[LEN_OF_FILE_PREFIX:]
with open(file_name, 'r') as data_file_handle:
for line in data_file_handle:
stuff=line.split()
# You might want to modify this condition to be 'length == 4'
# Not sure how much you value your data quality, but in some
# circumstances, I might be alarmed if I had more than 4 records in
# a given line, as that might indicate data corruption and/or an
# error in the collection method.
if(len(stuff) >= 4):
strain=stuff[0]
chrom=stuff[1]
posone=stuff[2]
postwo=stuff[3]
do_something_with_data(strain, chrom, posone, postwo)
# End open(file)
# End of for-loop over directory results
此方法利用前提条件来避免打开空文件 和 逻辑上无法包含足够有价值数据的文件,从而优化 I/O。此外,它在拆分文件后添加检查以确保给定行内的数据至少包含四个成员。此外,此解决方案取消了 RegEx 匹配文件名,如果代码只需要确认文件名以字符序列开头,则效率非常低。