如何读取多个文本文件，我们只读取同一组的所有文本文件？

Question

我的目录中有几个这样的文本文件，

id-2020-01-21-22.txt
id-2020-01-21-23.txt
id-2020-01-22-00.txt
id-2020-01-22-01.txt
id-2020-01-22-02.txt
id-2020-01-23-00.txt
id-2020-01-24-00.txt

那么我怎样才能像我一起阅读 id-2020-01-21-22.txt 和 id-2020-01-21-23.txt 一样阅读它们，将它们制作成一个数据框，将它们写入一个组合文本文件，然后 id-2020-01-22-00.txt & id-2020-01-22-01.txt & id-2020-01-22-02.txt 一起，将它们写入数据框，依此类推，直到目录中的最后一个文件。

所有文本文件的内部结构如下所示：

100232323\n
903812398\n
284934289\n
{empty line placeholder}

没有标题，但每个文本文件末尾都有一个空行。我是 python 的新手，如果你能帮助我，我将不胜感激。

这是我的进度：

import os

new_list = []
for root, dirs, files in os.walk('./textFilesFolder'):
    for file in files:
        if file.endswith('.txt'):
            with open(os.path.join(root, file), 'r') as f:
                text = f.read()
                new_list.append(text)


print(new_list)

Answer 1

您需要将每小时的文件串联在一起的每日摘要。好的，很好。

创建 Y-m-d 日期 regex:

import re

date_re = re.compile(r'^id-(\d{4}-\d{2}-\d{2})-\d{2}\.txt$')
prev_date = None

现在在您的循环中，您可以将现有的 if 替换为：

        m = date_re.search(file)
        if m:
            date = m.group(1)
            print(f'Working on day {date} ...')
            ...
            prev_date = date

解析出日期后，您现在可以注意到它何时更改，也许通过比较是否 prev_date == date, 并采取适当的措施，例如写入新文件。

或考虑使用with open(f'output-{date}.txt', 'a') as fout: 让您附加到一个（可能已经存在的）文件。这样文件系统就会为你记住东西，而不是需要跟踪程序中的更多变量。

顺便说一句，您对 walk() 的使用非常好，对此表示敬意。但是对于这个文件目录，结构足够简单你可以使用 glob:

new_list = []
for file in glob.glob('id-*.txt'):
    ...

编辑

假设我们从一张白纸开始，没有输出文件：

$ rm output-*.txt

然后我们可以在一个循环中追加，类似于$ cat hour01 hour02 > day31。或者，同样的事情，类似于 $ rm day31; cat hour01 >> day31; cat hour02 >> day31.

        m = date_re.search(file)
        if m:
            date = m.group(1)
            print(f'Working on day {date} ...')
            with open(file) as fin:
                with open(f'output-{date}.txt', 'a') as fout:
                    fout.write(fin.read())

就这样，大功告成！我们阅读每小时的课文，并且写到日常文件的末尾。

我在上面提到了 rm，因为如果您正在调试并且运行这两次或 N 次，你将得到一个大 N 倍的输出文件超出您的预期。

Answer 2

为了便于阅读，您也可以尝试这样做。

from collections import defaultdict
import os
import pandas as pd

data = defaultdict(list)
for i in (os.listdir('files/')): # here files is a folder in current directory.
    print(i)                     # which has your text files.
    column = i.split('-')[3]
    with open('files/'+i, 'r') as f:
        file_data = f.read().replace('\n', ' ').split(' ')
        data[column].extend(file_data[:-1])
df = pd.DataFrame(data)
print('---')
print(df)

输出：

id-2020-01-22-01.txt
id-2020-01-22-00.txt
id-2020-01-21-23.txt
id-2020-01-21-22.txt
---
          22          21
0    1006523  1002323212
1   90381122  9038123912
2   28493423   284934212
3  100232323   100232323
4  903812332   903812392
5  284934212   284934289

如何读取多个文本文件，我们只读取同一组的所有文本文件？

How to read multiple texts files, where we read all text files only of same group?

python

text-files

python-3.x