将几个文本文件的内容读入 pandas Dataframe

Read content of several text files into pandas Dataframe

在我的目录 Geolife 中,我有几个以以下格式命名的文本文件:

Geolife$ ls
labels106.txt  labels153.txt  labels73.txt
labels107.txt  labels154.txt  labels75.txt
labels108.txt  labels161.txt  labels76.txt
labels10.txt   labels163.txt  labels78.txt
labels110.txt  labels167.txt  labels80.txt
labels111.txt  labels170.txt  labels81.txt
...

这些文件中的每一个都包含制表符分隔格式的数据,例如:

Geolife$ cat labels10.txt
Start Time  End Time    Transportation Mode
2007/06/26 11:32:29 2007/06/26 11:40:29 bus
2008/03/28 14:52:54 2008/03/28 15:59:59 train
2008/03/28 16:00:00 2008/03/28 22:02:00 train
2008/03/29 01:27:50 2008/03/29 15:59:59 train
2008/03/29 16:00:00 2008/03/30 15:59:59 train
2008/03/30 16:00:00 2008/03/31 03:13:11 train
2008/03/31 04:17:59 2008/03/31 15:31:06 train
2008/03/31 16:00:08 2008/03/31 16:09:01 taxi
2008/03/31 17:26:04 2008/04/01 00:35:26 train
2008/04/01 00:48:32 2008/04/01 00:59:23 taxi
...

所以我想将此数据读入 pandas 数据框(每个文件第一列中的日期),添加一列来跟踪数据来自的文件编号。我对日期的时间部分也不感兴趣,只对日期感兴趣,所以我可以按年份、日期等进行分析。

在预期的输出中,(以上面的文件为例)将是:

User-ID  Date   Mode
10  2007-06-26  bus
10  2008-03-28  train
10  2008-03-28  train
10  2008-03-29  train
10  2008-03-29  train
10  2008-03-30  train
10  2008-03-31  train
10  2008-03-31  taxi
10  2008-03-31  train
10  2008-04-01  taxi
...
# and contents of all other files, e.g. labels106.txt
106 2007-10-07 car
106 2007-10-08 car
106 2007-10-09 car
.... 

如何做到这一点?

编辑

labels106.txt(与所有其他文件一样),包含相同格式的数据。

Geolife$ cat labels106.txt
Start Time  End Time    Transportation Mode
2007/10/07 16:00:00 2007/10/08 15:59:59 car
2007/10/08 16:00:00 2007/10/09 15:59:59 car
2007/10/09 16:00:00 2007/10/10 15:59:59 car

并非如您所愿,但此解决方案读取 .txt 文件并将数据写入 .csv 文件,然后您可以使用 pandas.read_csv(..) 方法读取该文件。

import os

files_dir ='your-geolife-dir'

for root, dirs, files in os.walk(files_dir):
    for file in files:
        if file.endswith('.txt'):
            user = file.strip('.txt')
            user = user[6:]
            
            with open(os.path.join(root,file), 'r') as f, open(os.path.join(root,
                'data.csv'), 'a') as out: # out - the output csv file
                for line in f:
                    line = line.rstrip()
                    line = line.replace('\t', ',')
                    line = line.replace('/', '-')
                    if not line.startswith('S'):
                        output = f'{user},{line}'
                        out.write(f'{output}\n')                       

输出:

$ cat data.csv
10,2007-06-26 11:32:29,2007-06-26 11:40:29,bus
10,2008-03-28 14:52:54,2008-03-28 15:59:59,train
10,2008-03-28 16:00:00,2008-03-28 22:02:00,train
10,2008-03-29 01:27:50,2008-03-29 15:59:59,train
10,2008-03-29 16:00:00,2008-03-30 15:59:59,train
10,2008-03-30 16:00:00,2008-03-31 03:13:11,train
10,2008-03-31 04:17:59,2008-03-31 15:31:06,train
10,2008-03-31 16:00:08,2008-03-31 16:09:01,taxi
10,2008-03-31 17:26:04,2008-04-01 00:35:26,train
10,2008-04-01 00:48:32,2008-04-01 00:59:23,taxi
106,2007-10-07 16:00:00,2007-10-08 15:59:59,car
106,2007-10-08 16:00:00,2007-10-09 15:59:59,car
106,2007-10-09 16:00:00,2007-10-10 15:59:59,car

您可以根据需要自定义解决方案(当然您可以使用 csv 标题)。