Dask ParserError: Error tokenizing data when reading CSV
Dask ParserError: Error tokenizing data when reading CSV
我收到与 this question 相同的错误,但设置 blocksize=None
的推荐解决方案无法解决我的问题。我正在尝试将 NYC 出租车数据从 CSV 转换为 Parquet,这是我的代码 运行:
ddf = dd.read_csv(
"s3://nyc-tlc/trip data/yellow_tripdata_2010-*.csv",
parse_dates=["pickup_datetime", "dropoff_datetime"],
blocksize=None,
dtype={
"tolls_amount": "float64",
"store_and_fwd_flag": "object",
},
)
ddf.to_parquet(
"s3://coiled-datasets/nyc-tlc/2010",
engine="pyarrow",
compression="snappy",
write_metadata_file=False,
)
这是我遇到的错误:
"ParserError: Error tokenizing data. C error: Expected 18 fields in line 2958, saw 19".
添加 blocksize=None
有时会有帮助,see here for example,我不确定为什么它不能解决我的问题。
关于如何解决这个问题有什么建议吗?
此代码适用于 2011 年的出租车数据,因此它们在 2010 年的出租车数据中一定有什么奇怪的东西导致了这个问题。
原始文件 s3://nyc-tlc/trip data/yellow_tripdata_2010-02.csv
包含错误(逗号过多)。这是违规行(中间)及其邻居:
VTS,2010-02-16 08:02:00,2010-02-16 08:14:00,5,4.2999999999999998,-73.955112999999997,40.786718,1,,-73.924710000000005,40.841335000000001,CSH,11.699999999999999,0,0.5,0,0,12.199999999999999
CMT,2010-02-24 16:25:18,2010-02-24 16:52:14,1,12.4,-73.988956000000002,40.736567000000001,1,,,-73.861762999999996,40.768383999999998,CAS,29.300000000000001,1,0.5,0,4.5700000000000003,35.369999999999997
VTS,2010-02-16 07:58:00,2010-02-16 08:09:00,1,2.9700000000000002,-73.977469999999997,40.779359999999997,1,,-74.004427000000007,40.742137999999997,CRD,9.3000000000000007,0,0.5,1.5,0,11.300000000000001
一些选项是:
on_bad_lines
kwarg to pandas 可以设置为 warn
或 skip
(因此 dask.dataframe
也应该可以;
使用 sed
之类的东西(假设您可以修改原始文件)或通过逐行读取文件来即时修复原始文件(知道错误在哪里)。
我收到与 this question 相同的错误,但设置 blocksize=None
的推荐解决方案无法解决我的问题。我正在尝试将 NYC 出租车数据从 CSV 转换为 Parquet,这是我的代码 运行:
ddf = dd.read_csv(
"s3://nyc-tlc/trip data/yellow_tripdata_2010-*.csv",
parse_dates=["pickup_datetime", "dropoff_datetime"],
blocksize=None,
dtype={
"tolls_amount": "float64",
"store_and_fwd_flag": "object",
},
)
ddf.to_parquet(
"s3://coiled-datasets/nyc-tlc/2010",
engine="pyarrow",
compression="snappy",
write_metadata_file=False,
)
这是我遇到的错误:
"ParserError: Error tokenizing data. C error: Expected 18 fields in line 2958, saw 19".
添加 blocksize=None
有时会有帮助,see here for example,我不确定为什么它不能解决我的问题。
关于如何解决这个问题有什么建议吗?
此代码适用于 2011 年的出租车数据,因此它们在 2010 年的出租车数据中一定有什么奇怪的东西导致了这个问题。
原始文件 s3://nyc-tlc/trip data/yellow_tripdata_2010-02.csv
包含错误(逗号过多)。这是违规行(中间)及其邻居:
VTS,2010-02-16 08:02:00,2010-02-16 08:14:00,5,4.2999999999999998,-73.955112999999997,40.786718,1,,-73.924710000000005,40.841335000000001,CSH,11.699999999999999,0,0.5,0,0,12.199999999999999
CMT,2010-02-24 16:25:18,2010-02-24 16:52:14,1,12.4,-73.988956000000002,40.736567000000001,1,,,-73.861762999999996,40.768383999999998,CAS,29.300000000000001,1,0.5,0,4.5700000000000003,35.369999999999997
VTS,2010-02-16 07:58:00,2010-02-16 08:09:00,1,2.9700000000000002,-73.977469999999997,40.779359999999997,1,,-74.004427000000007,40.742137999999997,CRD,9.3000000000000007,0,0.5,1.5,0,11.300000000000001
一些选项是:
on_bad_lines
kwarg to pandas 可以设置为warn
或skip
(因此dask.dataframe
也应该可以;使用
sed
之类的东西(假设您可以修改原始文件)或通过逐行读取文件来即时修复原始文件(知道错误在哪里)。