如何在 blaze 中读取制表符分隔的 CSV?
How do I read tabulator separated CSV in blaze?
我有一个 "CSV" 数据文件,格式如下(好吧,它更像是一个 TSV):
event pdg x y z t px py pz ekin
3383 11 -161.515 5.01938e-05 -0.000187112 0.195413 0.664065 0.126078 -0.736968 0.00723234
1694 11 -161.515 -0.000355633 0.000263174 0.195413 0.511853 -0.523429 0.681196 0.00472714
4228 11 -161.535 6.59631e-06 -3.32796e-05 0.194947 -0.713983 -0.0265468 -0.69966 0.0108681
4233 11 -161.515 -0.000524488 6.5069e-05 0.195413 0.942642 0.331324 0.0406377 0.017594
这个文件在 pandas
:
中可以按原样解释
from pandas import read_csv, read_table
data = read_csv("test.csv", sep="\t", index_col=False) # Works
data = read_table("test.csv", index_col=False) # Works
但是,当我尝试在 blaze
(声明使用 pandas 关键字参数)中读取它时,抛出异常:
from blaze import Data
Data("test.csv") # Attempt 1
Data("test.csv", sep="\t") # Attempt 2
Data("test.csv", sep="\t", index_col=False) # Attempt 3
None 这些作品和 pandas 根本没有使用。 "sniffer" 试图推断列名和类型只是从标准库调用 csv.Sniffer.sniff()
(失败)。
有没有办法在 blaze 中正确读取此文件(鉴于其 "little brother" 有几百 MB,我想使用 blaze 的顺序处理功能)?
感谢任何想法。
编辑: 我认为这可能是 odo/csv 的问题并提交了一个问题:https://github.com/blaze/odo/issues/327
编辑2:
完整错误:
Error Traceback (most recent call last) in () ----> 1 bz.Data("test.csv", sep="\t", index_col=False)
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/blaze/interactive.py in Data(data, dshape, name, fields, columns, schema, **kwargs)
54 if isinstance(data, _strtypes):
55 data = resource(data, schema=schema, dshape=dshape, columns=columns,
---> 56 **kwargs)
57 if (isinstance(data, Iterator) and
58 not isinstance(data, tuple(not_an_iterator))):
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs)
62
63 def __call__(self, s, *args, **kwargs):
---> 64 return self.dispatch(s)(s, *args, **kwargs)
65
66 @property
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in resource_csv(uri, **kwargs)
276 @resource.register('.+\.(csv|tsv|ssv|data|dat)(\.gz|\.bz2?)?')
277 def resource_csv(uri, **kwargs):
--> 278 return CSV(uri, **kwargs)
279
280
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in __init__(self, path, has_header, encoding, sniff_nbytes, **kwargs)
102 if has_header is None:
103 self.has_header = (not os.path.exists(path) or
--> 104 infer_header(path, sniff_nbytes))
105 else:
106 self.has_header = has_header
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in infer_header(path, nbytes, encoding, **kwargs)
58 with open_file(path, 'rb') as f:
59 raw = f.read(nbytes)
---> 60 return csv.Sniffer().has_header(raw if PY2 else raw.decode(encoding))
61
62
/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in has_header(self, sample)
392 # subtracting from the likelihood of the first row being a header.
393
--> 394 rdr = reader(StringIO(sample), self.sniff(sample))
395
396 header = next(rdr) # assume first row is header
/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in sniff(self, sample, delimiters)
187
188 if not delimiter:
--> 189 raise Error("Could not determine delimiter")
190
191 class dialect(Dialect):
Error: Could not determine delimiter
我正在使用 Python 2.7.10,dask v0.7.1,blaze v0.8.2 和 conda v3.17.0。
conda install dask
conda install blaze
您可以通过以下方式导入数据以供 blaze 使用。先用pandas解析数据,再转换成blaze。也许这违背了目的,但这样不会有麻烦。
作为旁注,为了正确解析数据文件,您在 pandas 解析语句中的行应该是:
from blaze import Data
from pandas import DataFrame, read_csv
data = read_csv("csvdata.dat", sep="\s*", index_col=False)
bdata = Data(data)
现在数据格式正确,没有错误,bdata
:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
这是一个替代方案,使用 dask,它可能可以执行您正在寻找的相同分块或大规模处理。 Dask 确实可以立即轻松地正确加载 tsv 格式。
In [17]: import dask.dataframe as dd
In [18]: df = dd.read_csv('tsvdata.txt', sep='\t', index_col=False)
In [19]: df.head()
Out[19]:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
4 854 11 -161.515 0.000032 0.000418 0.195414 0.675752 0.315671
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
3 0.040638 0.017594
4 -0.666116 0.012641
In [20]:
另请参阅:http://dask.pydata.org/en/latest/array-blaze.html#how-to-use-blaze-with-dask
我有一个 "CSV" 数据文件,格式如下(好吧,它更像是一个 TSV):
event pdg x y z t px py pz ekin 3383 11 -161.515 5.01938e-05 -0.000187112 0.195413 0.664065 0.126078 -0.736968 0.00723234 1694 11 -161.515 -0.000355633 0.000263174 0.195413 0.511853 -0.523429 0.681196 0.00472714 4228 11 -161.535 6.59631e-06 -3.32796e-05 0.194947 -0.713983 -0.0265468 -0.69966 0.0108681 4233 11 -161.515 -0.000524488 6.5069e-05 0.195413 0.942642 0.331324 0.0406377 0.017594
这个文件在 pandas
:
from pandas import read_csv, read_table
data = read_csv("test.csv", sep="\t", index_col=False) # Works
data = read_table("test.csv", index_col=False) # Works
但是,当我尝试在 blaze
(声明使用 pandas 关键字参数)中读取它时,抛出异常:
from blaze import Data
Data("test.csv") # Attempt 1
Data("test.csv", sep="\t") # Attempt 2
Data("test.csv", sep="\t", index_col=False) # Attempt 3
None 这些作品和 pandas 根本没有使用。 "sniffer" 试图推断列名和类型只是从标准库调用 csv.Sniffer.sniff()
(失败)。
有没有办法在 blaze 中正确读取此文件(鉴于其 "little brother" 有几百 MB,我想使用 blaze 的顺序处理功能)?
感谢任何想法。
编辑: 我认为这可能是 odo/csv 的问题并提交了一个问题:https://github.com/blaze/odo/issues/327
编辑2: 完整错误:
Error Traceback (most recent call last) in () ----> 1 bz.Data("test.csv", sep="\t", index_col=False) /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/blaze/interactive.py in Data(data, dshape, name, fields, columns, schema, **kwargs) 54 if isinstance(data, _strtypes): 55 data = resource(data, schema=schema, dshape=dshape, columns=columns, ---> 56 **kwargs) 57 if (isinstance(data, Iterator) and 58 not isinstance(data, tuple(not_an_iterator))): /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs) 62 63 def __call__(self, s, *args, **kwargs): ---> 64 return self.dispatch(s)(s, *args, **kwargs) 65 66 @property /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in resource_csv(uri, **kwargs) 276 @resource.register('.+\.(csv|tsv|ssv|data|dat)(\.gz|\.bz2?)?') 277 def resource_csv(uri, **kwargs): --> 278 return CSV(uri, **kwargs) 279 280 /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in __init__(self, path, has_header, encoding, sniff_nbytes, **kwargs) 102 if has_header is None: 103 self.has_header = (not os.path.exists(path) or --> 104 infer_header(path, sniff_nbytes)) 105 else: 106 self.has_header = has_header /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in infer_header(path, nbytes, encoding, **kwargs) 58 with open_file(path, 'rb') as f: 59 raw = f.read(nbytes) ---> 60 return csv.Sniffer().has_header(raw if PY2 else raw.decode(encoding)) 61 62 /home/[username-hidden]/anaconda3/lib/python3.4/csv.py in has_header(self, sample) 392 # subtracting from the likelihood of the first row being a header. 393 --> 394 rdr = reader(StringIO(sample), self.sniff(sample)) 395 396 header = next(rdr) # assume first row is header /home/[username-hidden]/anaconda3/lib/python3.4/csv.py in sniff(self, sample, delimiters) 187 188 if not delimiter: --> 189 raise Error("Could not determine delimiter") 190 191 class dialect(Dialect): Error: Could not determine delimiter
我正在使用 Python 2.7.10,dask v0.7.1,blaze v0.8.2 和 conda v3.17.0。
conda install dask
conda install blaze
您可以通过以下方式导入数据以供 blaze 使用。先用pandas解析数据,再转换成blaze。也许这违背了目的,但这样不会有麻烦。
作为旁注,为了正确解析数据文件,您在 pandas 解析语句中的行应该是:
from blaze import Data
from pandas import DataFrame, read_csv
data = read_csv("csvdata.dat", sep="\s*", index_col=False)
bdata = Data(data)
现在数据格式正确,没有错误,bdata
:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
这是一个替代方案,使用 dask,它可能可以执行您正在寻找的相同分块或大规模处理。 Dask 确实可以立即轻松地正确加载 tsv 格式。
In [17]: import dask.dataframe as dd
In [18]: df = dd.read_csv('tsvdata.txt', sep='\t', index_col=False)
In [19]: df.head()
Out[19]:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
4 854 11 -161.515 0.000032 0.000418 0.195414 0.675752 0.315671
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
3 0.040638 0.017594
4 -0.666116 0.012641
In [20]:
另请参阅:http://dask.pydata.org/en/latest/array-blaze.html#how-to-use-blaze-with-dask