如何使用 python 在文本文件文件夹中搜索字符串

Question

我正在编写一些脚本来处理 python 中的一些文本文件。在本地，脚本从单个 txt 文件读取，因此我使用

index_file =  open('index.txt', 'r')
    for line in index_file:
       ....

并循环遍历文件以找到匹配的字符串，但是在使用亚马逊 EMR 时，index.txt 文件本身被拆分为单个文件夹中的多个 txt 文件。

因此我想在本地复制它并从多个 txt 文件中读取某个字符串，但我很难找到干净的代码来做到这一点。

编写最少代码的最佳方法是什么？

Answer 1

import os
from glob import glob

def readindex(path):
    pattern = '*.txt'
    full_path = os.path.join(path, pattern)
    for fname in sorted(glob(full_path)):
        for line in open(fname, 'r'):
            yield line
# read lines to memory list for using multiple times
linelist = list(readindex("directory"))
for line in linelist:
    print line,

此脚本定义了一个生成器 (see this question for details about generators) 以按排序顺序遍历目录 "directory" 中扩展名为 "txt" 的所有文件。它将所有行作为一个流生成，在调用该函数后可以对其进行迭代，就好像这些行来自一个打开的文件一样，因为这似乎是问题作者想要的。打印行末尾的逗号确保换行符不会打印两次，尽管 for 循环的内容无论如何都会被问题作者替换。在那种情况下，可以使用 line.rstrip() 来摆脱换行符。

The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.

如何使用 python 在文本文件文件夹中搜索字符串

how to search string in a folder of text files using python

python

emr