将几个文本文件的内容读入 pandas Dataframe
Read content of several text files into pandas Dataframe
在我的目录 Geolife
中,我有几个以以下格式命名的文本文件:
Geolife$ ls
labels106.txt labels153.txt labels73.txt
labels107.txt labels154.txt labels75.txt
labels108.txt labels161.txt labels76.txt
labels10.txt labels163.txt labels78.txt
labels110.txt labels167.txt labels80.txt
labels111.txt labels170.txt labels81.txt
...
这些文件中的每一个都包含制表符分隔格式的数据,例如:
Geolife$ cat labels10.txt
Start Time End Time Transportation Mode
2007/06/26 11:32:29 2007/06/26 11:40:29 bus
2008/03/28 14:52:54 2008/03/28 15:59:59 train
2008/03/28 16:00:00 2008/03/28 22:02:00 train
2008/03/29 01:27:50 2008/03/29 15:59:59 train
2008/03/29 16:00:00 2008/03/30 15:59:59 train
2008/03/30 16:00:00 2008/03/31 03:13:11 train
2008/03/31 04:17:59 2008/03/31 15:31:06 train
2008/03/31 16:00:08 2008/03/31 16:09:01 taxi
2008/03/31 17:26:04 2008/04/01 00:35:26 train
2008/04/01 00:48:32 2008/04/01 00:59:23 taxi
...
所以我想将此数据读入 pandas 数据框(每个文件第一列中的日期),添加一列来跟踪数据来自的文件编号。我对日期的时间部分也不感兴趣,只对日期感兴趣,所以我可以按年份、日期等进行分析。
在预期的输出中,(以上面的文件为例)将是:
User-ID Date Mode
10 2007-06-26 bus
10 2008-03-28 train
10 2008-03-28 train
10 2008-03-29 train
10 2008-03-29 train
10 2008-03-30 train
10 2008-03-31 train
10 2008-03-31 taxi
10 2008-03-31 train
10 2008-04-01 taxi
...
# and contents of all other files, e.g. labels106.txt
106 2007-10-07 car
106 2007-10-08 car
106 2007-10-09 car
....
如何做到这一点?
编辑
labels106.txt
(与所有其他文件一样),包含相同格式的数据。
Geolife$ cat labels106.txt
Start Time End Time Transportation Mode
2007/10/07 16:00:00 2007/10/08 15:59:59 car
2007/10/08 16:00:00 2007/10/09 15:59:59 car
2007/10/09 16:00:00 2007/10/10 15:59:59 car
并非如您所愿,但此解决方案读取 .txt
文件并将数据写入 .csv
文件,然后您可以使用 pandas.read_csv(..)
方法读取该文件。
import os
files_dir ='your-geolife-dir'
for root, dirs, files in os.walk(files_dir):
for file in files:
if file.endswith('.txt'):
user = file.strip('.txt')
user = user[6:]
with open(os.path.join(root,file), 'r') as f, open(os.path.join(root,
'data.csv'), 'a') as out: # out - the output csv file
for line in f:
line = line.rstrip()
line = line.replace('\t', ',')
line = line.replace('/', '-')
if not line.startswith('S'):
output = f'{user},{line}'
out.write(f'{output}\n')
输出:
$ cat data.csv
10,2007-06-26 11:32:29,2007-06-26 11:40:29,bus
10,2008-03-28 14:52:54,2008-03-28 15:59:59,train
10,2008-03-28 16:00:00,2008-03-28 22:02:00,train
10,2008-03-29 01:27:50,2008-03-29 15:59:59,train
10,2008-03-29 16:00:00,2008-03-30 15:59:59,train
10,2008-03-30 16:00:00,2008-03-31 03:13:11,train
10,2008-03-31 04:17:59,2008-03-31 15:31:06,train
10,2008-03-31 16:00:08,2008-03-31 16:09:01,taxi
10,2008-03-31 17:26:04,2008-04-01 00:35:26,train
10,2008-04-01 00:48:32,2008-04-01 00:59:23,taxi
106,2007-10-07 16:00:00,2007-10-08 15:59:59,car
106,2007-10-08 16:00:00,2007-10-09 15:59:59,car
106,2007-10-09 16:00:00,2007-10-10 15:59:59,car
您可以根据需要自定义解决方案(当然您可以使用 csv 标题)。
在我的目录 Geolife
中,我有几个以以下格式命名的文本文件:
Geolife$ ls
labels106.txt labels153.txt labels73.txt
labels107.txt labels154.txt labels75.txt
labels108.txt labels161.txt labels76.txt
labels10.txt labels163.txt labels78.txt
labels110.txt labels167.txt labels80.txt
labels111.txt labels170.txt labels81.txt
...
这些文件中的每一个都包含制表符分隔格式的数据,例如:
Geolife$ cat labels10.txt
Start Time End Time Transportation Mode
2007/06/26 11:32:29 2007/06/26 11:40:29 bus
2008/03/28 14:52:54 2008/03/28 15:59:59 train
2008/03/28 16:00:00 2008/03/28 22:02:00 train
2008/03/29 01:27:50 2008/03/29 15:59:59 train
2008/03/29 16:00:00 2008/03/30 15:59:59 train
2008/03/30 16:00:00 2008/03/31 03:13:11 train
2008/03/31 04:17:59 2008/03/31 15:31:06 train
2008/03/31 16:00:08 2008/03/31 16:09:01 taxi
2008/03/31 17:26:04 2008/04/01 00:35:26 train
2008/04/01 00:48:32 2008/04/01 00:59:23 taxi
...
所以我想将此数据读入 pandas 数据框(每个文件第一列中的日期),添加一列来跟踪数据来自的文件编号。我对日期的时间部分也不感兴趣,只对日期感兴趣,所以我可以按年份、日期等进行分析。
在预期的输出中,(以上面的文件为例)将是:
User-ID Date Mode
10 2007-06-26 bus
10 2008-03-28 train
10 2008-03-28 train
10 2008-03-29 train
10 2008-03-29 train
10 2008-03-30 train
10 2008-03-31 train
10 2008-03-31 taxi
10 2008-03-31 train
10 2008-04-01 taxi
...
# and contents of all other files, e.g. labels106.txt
106 2007-10-07 car
106 2007-10-08 car
106 2007-10-09 car
....
如何做到这一点?
编辑
labels106.txt
(与所有其他文件一样),包含相同格式的数据。
Geolife$ cat labels106.txt
Start Time End Time Transportation Mode
2007/10/07 16:00:00 2007/10/08 15:59:59 car
2007/10/08 16:00:00 2007/10/09 15:59:59 car
2007/10/09 16:00:00 2007/10/10 15:59:59 car
并非如您所愿,但此解决方案读取 .txt
文件并将数据写入 .csv
文件,然后您可以使用 pandas.read_csv(..)
方法读取该文件。
import os
files_dir ='your-geolife-dir'
for root, dirs, files in os.walk(files_dir):
for file in files:
if file.endswith('.txt'):
user = file.strip('.txt')
user = user[6:]
with open(os.path.join(root,file), 'r') as f, open(os.path.join(root,
'data.csv'), 'a') as out: # out - the output csv file
for line in f:
line = line.rstrip()
line = line.replace('\t', ',')
line = line.replace('/', '-')
if not line.startswith('S'):
output = f'{user},{line}'
out.write(f'{output}\n')
输出:
$ cat data.csv
10,2007-06-26 11:32:29,2007-06-26 11:40:29,bus
10,2008-03-28 14:52:54,2008-03-28 15:59:59,train
10,2008-03-28 16:00:00,2008-03-28 22:02:00,train
10,2008-03-29 01:27:50,2008-03-29 15:59:59,train
10,2008-03-29 16:00:00,2008-03-30 15:59:59,train
10,2008-03-30 16:00:00,2008-03-31 03:13:11,train
10,2008-03-31 04:17:59,2008-03-31 15:31:06,train
10,2008-03-31 16:00:08,2008-03-31 16:09:01,taxi
10,2008-03-31 17:26:04,2008-04-01 00:35:26,train
10,2008-04-01 00:48:32,2008-04-01 00:59:23,taxi
106,2007-10-07 16:00:00,2007-10-08 15:59:59,car
106,2007-10-08 16:00:00,2007-10-09 15:59:59,car
106,2007-10-09 16:00:00,2007-10-10 15:59:59,car
您可以根据需要自定义解决方案(当然您可以使用 csv 标题)。