Pandas select 列和数据依赖于 header
Pandas select columns and data dependant on header
我有一个很大的 .csv 文件。我只想 select 包含他 time/date 的列和我通过 header 知道的其他 20 个列。
作为测试,我尝试只使用带有 header 'TIMESTAMP' 的列,我知道这是
.csv 中有 4207823 行,它只包含日期和时间。下面的代码 select 是 TIMESTAMP 列,但也继续从其他列中获取值,如下所示:
import csv
import numpy as np
import pandas
low_memory=False
f = pandas.read_csv('C:\Users\mmso2\Google Drive\MABL Wind\_Semester 2 2016\Wind Farm Info\DataB\DataB - NaN2.csv', dtype = object)#convert file to variable so it can be edited
time = f[['TIMESTAMP']]
time = time[0:4207823]#test to see if this stops time taking other data
print time
输出
TIMESTAMP
0 2007-08-15 21:10:00
1 2007-08-15 21:20:00
2 2007-08-15 21:30:00
3 2007-08-15 21:40:00
4 2007-08-15 21:50:00
5 2007-08-15 22:00:00
6 2007-08-15 22:10:00
7 2007-08-15 22:20:00
8 2007-08-15 22:30:00
9 2007-08-15 22:40:00
10 2007-08-15 22:50:00
11 2007-08-15 23:00:00
12 2007-08-15 23:10:00
13 2007-08-15 23:20:00
14 2007-08-15 23:30:00
15 2007-08-15 23:40:00
16 2007-08-15 23:50:00
17 2007-08-16 00:00:00
18 2007-08-16 00:10:00
19 2007-08-16 00:20:00
20 2007-08-16 00:30:00
21 2007-08-16 00:40:00
22 2007-08-16 00:50:00
23 2007-08-16 01:00:00
24 2007-08-16 01:10:00
25 2007-08-16 01:20:00
26 2007-08-16 01:30:00
27 2007-08-16 01:40:00
28 2007-08-16 01:50:00
29 2007-08-16 02:00:00 #these are from the TIMESTAMP column
... ...
679302 221.484 #This is from another column
679303 NaN
679304 2015-09-23 06:40:00
679305 NaN
679306 NaN
679307 2015-09-23 06:50:00
679308 NaN
679309 NaN
679310 2015-09-23 07:00:00
问题是由于输入文件中的错误,因此在 pandas.read_csv
中简单地使用 usecols
就奏效了。
下面的代码演示了选择几列数据
import csv
import pandas
low_memory=False
#read only the selected columns
df = pandas.read_csv('DataB - Copy - Copy.csv',delimiter=',', dtype = object,
usecols=['TIMESTAMP', 'igmmx_U_77m', 'igmmx_U_58m', ])
print df # see what the data looks like
outfile = open('DataB_GreaterGabbardOnly.csv','wb')#somewhere to write the data to
df.to_csv(outfile)#save selection to the blank .csv created above
我有一个很大的 .csv 文件。我只想 select 包含他 time/date 的列和我通过 header 知道的其他 20 个列。
作为测试,我尝试只使用带有 header 'TIMESTAMP' 的列,我知道这是 .csv 中有 4207823 行,它只包含日期和时间。下面的代码 select 是 TIMESTAMP 列,但也继续从其他列中获取值,如下所示:
import csv
import numpy as np
import pandas
low_memory=False
f = pandas.read_csv('C:\Users\mmso2\Google Drive\MABL Wind\_Semester 2 2016\Wind Farm Info\DataB\DataB - NaN2.csv', dtype = object)#convert file to variable so it can be edited
time = f[['TIMESTAMP']]
time = time[0:4207823]#test to see if this stops time taking other data
print time
输出
TIMESTAMP
0 2007-08-15 21:10:00
1 2007-08-15 21:20:00
2 2007-08-15 21:30:00
3 2007-08-15 21:40:00
4 2007-08-15 21:50:00
5 2007-08-15 22:00:00
6 2007-08-15 22:10:00
7 2007-08-15 22:20:00
8 2007-08-15 22:30:00
9 2007-08-15 22:40:00
10 2007-08-15 22:50:00
11 2007-08-15 23:00:00
12 2007-08-15 23:10:00
13 2007-08-15 23:20:00
14 2007-08-15 23:30:00
15 2007-08-15 23:40:00
16 2007-08-15 23:50:00
17 2007-08-16 00:00:00
18 2007-08-16 00:10:00
19 2007-08-16 00:20:00
20 2007-08-16 00:30:00
21 2007-08-16 00:40:00
22 2007-08-16 00:50:00
23 2007-08-16 01:00:00
24 2007-08-16 01:10:00
25 2007-08-16 01:20:00
26 2007-08-16 01:30:00
27 2007-08-16 01:40:00
28 2007-08-16 01:50:00
29 2007-08-16 02:00:00 #these are from the TIMESTAMP column
... ...
679302 221.484 #This is from another column
679303 NaN
679304 2015-09-23 06:40:00
679305 NaN
679306 NaN
679307 2015-09-23 06:50:00
679308 NaN
679309 NaN
679310 2015-09-23 07:00:00
问题是由于输入文件中的错误,因此在 pandas.read_csv
中简单地使用 usecols
就奏效了。
下面的代码演示了选择几列数据
import csv
import pandas
low_memory=False
#read only the selected columns
df = pandas.read_csv('DataB - Copy - Copy.csv',delimiter=',', dtype = object,
usecols=['TIMESTAMP', 'igmmx_U_77m', 'igmmx_U_58m', ])
print df # see what the data looks like
outfile = open('DataB_GreaterGabbardOnly.csv','wb')#somewhere to write the data to
df.to_csv(outfile)#save selection to the blank .csv created above