使用 os.listdir 时如何跳过其中没有任何内容的文件

How can I skip a file when using os.listdir that has nothing written in it

我正在尝试浏览一个目录中的所有文件并制作另一个文本文件,最终允许我根据我的数据绘制图表。一些 VariantLine# 文件中不包含任何信息,因为在我的任何菌株中都没有发现这些变体。当我开始遍历我的 for 循环时,它说我的列表索引超出范围,但这种情况发生在没有写入任何内容的文件中。我有超过 10,000 个 VariantLine# 文件,所以我不想逐个检查每个文件并删除所有其中没有任何内容的文件。我只想解析那些确实包含其中所写信息的内容,因为这些内容将为我提供制作情节所需的信息。到目前为止我发现的唯一信息只能处理跳过没有信息的行,而不是整个文件。

for files in os.listdir("/nobackup/rogers_research/tmiorin/DsantRNAproject"):
    if re.search("^VariantLine", files):
        filename=files
        filenumber=filename[11:]
        print filenumber
        for line in filename:
            stuff=line.split()
            strain=stuff[0]
            chrom=stuff[1]
            posone=stuff[2]
            postwo=stuff[3]

本质上我的问题是我需要一种方法来仅解析其中写入内容的文件,因此理想情况下我需要在 "for line in filename" 之前放置一行代码来读取文件并且仅如果它确实打印了内容,则继续 for 循环。我似乎无法在网上找到任何信息,所以如果有人碰巧知道我可以说些什么,我将不胜感激。谢谢!

for files in os.listdir("/nobackup/rogers_research/tmiorin/DsantRNAproject"):
    if re.search("^VariantLine", files):
        filename=files
        filenumber=filename[11:]
        print filenumber
        for line in filename:
            if (not line==""):
                stuff=line.split()
                strain=stuff[0]
                chrom=stuff[1]
                posone=stuff[2]
                postwo=stuff[3]

if (not line==""): 检查该行是否不为空,如果你希望你可以检查 even line is not equal to "\n" 如果这不起作用

当遍历文件时,首先检查它们的大小并且只处理大小 > 0 的文件:

if os.stat(filename).st_size > 0:
    <work>

这段代码不仅要检查拆分结果的 return 长度,而且还可以通过避免首先打开空文件来优化自身,如下所示:

DATA_FILE_PREFIX = 'VariantLine'
# We expect each line of the file to contain 4 records and we will separate them
# with a split operation.
# Split, by default, tries to split on whitespace. Therefore, each file should
# contain a minumum of (4 data bytes + 3 delimiting bytes) = 7 total bytes
MIN_DATA_FILE_BYTE_SIZE = 7

# Get contents form directory as os.DirEntry objects
dir_contents = os.scandir("/nobackup/rogers_research/tmiorin/DsantRNAproject")

# Filter directory contents to ensure that we only look at FILES, whose names
# match our known file prefix, and whose size in bytes is greater than min.
data_files_in_dir = [
    file_result
    for file_result in dir_contents
    if (file_result.is_file()
        and file_result.name.startswith(DATA_FILE_PREFIX)
        and file_result.stat().st_size >= MIN_DATA_FILE_BYTE_SIZE)
]

# Just calling this out explicitly so we can avoid calling len() each iteration
LEN_OF_FILE_PREFIX = len(DATA_FILE_PREFIX);
# Open all data files and read them.
for file_result in data_files_in_dir:
    file_name = file_result.name
    file_number = file_name[LEN_OF_FILE_PREFIX:]
    with open(file_name, 'r') as data_file_handle:
        for line in data_file_handle:
            stuff=line.split()
            # You might want to modify this condition to be 'length == 4'
            # Not sure how much you value your data quality, but in some
            # circumstances, I might be alarmed if I had more than 4 records in
            # a given line, as that might indicate data corruption and/or an
            # error in the collection method.
            if(len(stuff) >= 4):
                strain=stuff[0]
                chrom=stuff[1]
                posone=stuff[2]
                postwo=stuff[3]
                do_something_with_data(strain, chrom, posone, postwo)
    # End open(file)
# End of for-loop over directory results

此方法利用前提条件来避免打开空文件 逻辑上无法包含足够有价值数据的文件,从而优化 I/O。此外,它在拆分文件后添加检查以确保给定行内的数据至少包含四个成员。此外,此解决方案取消了 RegEx 匹配文件名,如果代码只需要确认文件名以字符序列开头,则效率非常低。