在 Windows 上读取 snappy parquet 文件导致 python 崩溃

Reading snappy parquet files on Windows causes python to crash

我无法通过 Windows 上的 pyarrow 读取活泼的镶木地板文件。

import dask.dataframe as dd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(15, 4)), columns=list('ABCD'))
dd_df = dd.from_pandas(df, npartitions=1)
dd_df.to_parquet("my_df.snappy.parquet", engine="pyarrow", compression="snappy")
dd_df_copy = dd.read_parquet("my_df.snappy.parquet", engine="pyarrow")
dd_df_copy.compute() #<--- This is where it crashes

我已经在 Python 3.8 的干净 Anaconda 环境中重现了这个问题。创建环境后,我 运行 pip install "dask[complete]"pip install pyarrow

错误是:

Problem signature:
  Problem Event Name:   APPCRASH
  Application Name: python.exe
  Application Version:  3.8.3150.1013
  Application Timestamp:    5ed53446
  Fault Module Name:    arrow.dll
  Fault Module Version: 0.0.0.0
  Fault Module Timestamp:   5ebd3029
  Exception Code:   c000001d
  Exception Offset: 00000000007abfc7
  OS Version:   6.3.9600.2.0.0.16.7
  Locale ID:    1033
  Additional Information 1: d8e4
  Additional Information 2: d8e42c04b828d96accf490cd13472bea
  Additional Information 3: aebe
  Additional Information 4: aebe917bfb5c1b58e884baa1f9c3d3d2

当我尝试使用 conda -c conda-forge dask pyarrow:

时出现类似版本的崩溃
Problem signature:
  Problem Event Name:   APPCRASH
  Application Name: python.exe
  Application Version:  3.8.3150.1013
  Application Timestamp:    5ed53446
  Fault Module Name:    arrow.dll
  Fault Module Version: 0.0.0.0
  Fault Module Timestamp:   5ecf56ac
  Exception Code:   c000001d
  Exception Offset: 0000000000521587
  OS Version:   6.3.9600.2.0.0.16.7
  Locale ID:    1033
  Additional Information 1: e863
  Additional Information 2: e8638a01b9fb70505b0604ef9b98f3c6
  Additional Information 3: 1e47
  Additional Information 4: 1e47c852f479606e071f3ea8f80878a1

自 2020 年 7 月 1 日起,更新包修复了此问题。我认为是 pyarrow 更新解决了这个问题。