Dask ParserError: Error tokenizing data when reading CSV

Dask ParserError: Error tokenizing data when reading CSV

我收到与 this question 相同的错误,但设置 blocksize=None 的推荐解决方案无法解决我的问题。我正在尝试将 NYC 出租车数据从 CSV 转换为 Parquet,这是我的代码 运行:

ddf = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2010-*.csv",
    parse_dates=["pickup_datetime", "dropoff_datetime"],
    blocksize=None,
    dtype={
        "tolls_amount": "float64",
        "store_and_fwd_flag": "object",
    },
)

ddf.to_parquet(
    "s3://coiled-datasets/nyc-tlc/2010",
    engine="pyarrow",
    compression="snappy",
    write_metadata_file=False,
)

这是我遇到的错误:

"ParserError: Error tokenizing data. C error: Expected 18 fields in line 2958, saw 19".

添加 blocksize=None 有时会有帮助,see here for example,我不确定为什么它不能解决我的问题。

关于如何解决这个问题有什么建议吗?

此代码适用于 2011 年的出租车数据,因此它们在 2010 年的出租车数据中一定有什么奇怪的东西导致了这个问题。

原始文件 s3://nyc-tlc/trip data/yellow_tripdata_2010-02.csv 包含错误(逗号过多)。这是违规行(中间)及其邻居:

VTS,2010-02-16 08:02:00,2010-02-16 08:14:00,5,4.2999999999999998,-73.955112999999997,40.786718,1,,-73.924710000000005,40.841335000000001,CSH,11.699999999999999,0,0.5,0,0,12.199999999999999
CMT,2010-02-24 16:25:18,2010-02-24 16:52:14,1,12.4,-73.988956000000002,40.736567000000001,1,,,-73.861762999999996,40.768383999999998,CAS,29.300000000000001,1,0.5,0,4.5700000000000003,35.369999999999997
VTS,2010-02-16 07:58:00,2010-02-16 08:09:00,1,2.9700000000000002,-73.977469999999997,40.779359999999997,1,,-74.004427000000007,40.742137999999997,CRD,9.3000000000000007,0,0.5,1.5,0,11.300000000000001

一些选项是:

  • on_bad_lines kwarg to pandas 可以设置为 warnskip (因此 dask.dataframe 也应该可以;

  • 使用 sed 之类的东西(假设您可以修改原始文件)或通过逐行读取文件来即时修复原始文件(知道错误在哪里)。