在 Windows 上读取 snappy parquet 文件导致 python 崩溃
Reading snappy parquet files on Windows causes python to crash
我无法通过 Windows 上的 pyarrow 读取活泼的镶木地板文件。
import dask.dataframe as dd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(15, 4)), columns=list('ABCD'))
dd_df = dd.from_pandas(df, npartitions=1)
dd_df.to_parquet("my_df.snappy.parquet", engine="pyarrow", compression="snappy")
dd_df_copy = dd.read_parquet("my_df.snappy.parquet", engine="pyarrow")
dd_df_copy.compute() #<--- This is where it crashes
我已经在 Python 3.8 的干净 Anaconda 环境中重现了这个问题。创建环境后,我 运行 pip install "dask[complete]"
和 pip install pyarrow
错误是:
Problem signature:
Problem Event Name: APPCRASH
Application Name: python.exe
Application Version: 3.8.3150.1013
Application Timestamp: 5ed53446
Fault Module Name: arrow.dll
Fault Module Version: 0.0.0.0
Fault Module Timestamp: 5ebd3029
Exception Code: c000001d
Exception Offset: 00000000007abfc7
OS Version: 6.3.9600.2.0.0.16.7
Locale ID: 1033
Additional Information 1: d8e4
Additional Information 2: d8e42c04b828d96accf490cd13472bea
Additional Information 3: aebe
Additional Information 4: aebe917bfb5c1b58e884baa1f9c3d3d2
当我尝试使用 conda -c conda-forge dask pyarrow
:
时出现类似版本的崩溃
Problem signature:
Problem Event Name: APPCRASH
Application Name: python.exe
Application Version: 3.8.3150.1013
Application Timestamp: 5ed53446
Fault Module Name: arrow.dll
Fault Module Version: 0.0.0.0
Fault Module Timestamp: 5ecf56ac
Exception Code: c000001d
Exception Offset: 0000000000521587
OS Version: 6.3.9600.2.0.0.16.7
Locale ID: 1033
Additional Information 1: e863
Additional Information 2: e8638a01b9fb70505b0604ef9b98f3c6
Additional Information 3: 1e47
Additional Information 4: 1e47c852f479606e071f3ea8f80878a1
自 2020 年 7 月 1 日起,更新包修复了此问题。我认为是 pyarrow
更新解决了这个问题。
我无法通过 Windows 上的 pyarrow 读取活泼的镶木地板文件。
import dask.dataframe as dd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(15, 4)), columns=list('ABCD'))
dd_df = dd.from_pandas(df, npartitions=1)
dd_df.to_parquet("my_df.snappy.parquet", engine="pyarrow", compression="snappy")
dd_df_copy = dd.read_parquet("my_df.snappy.parquet", engine="pyarrow")
dd_df_copy.compute() #<--- This is where it crashes
我已经在 Python 3.8 的干净 Anaconda 环境中重现了这个问题。创建环境后,我 运行 pip install "dask[complete]"
和 pip install pyarrow
错误是:
Problem signature:
Problem Event Name: APPCRASH
Application Name: python.exe
Application Version: 3.8.3150.1013
Application Timestamp: 5ed53446
Fault Module Name: arrow.dll
Fault Module Version: 0.0.0.0
Fault Module Timestamp: 5ebd3029
Exception Code: c000001d
Exception Offset: 00000000007abfc7
OS Version: 6.3.9600.2.0.0.16.7
Locale ID: 1033
Additional Information 1: d8e4
Additional Information 2: d8e42c04b828d96accf490cd13472bea
Additional Information 3: aebe
Additional Information 4: aebe917bfb5c1b58e884baa1f9c3d3d2
当我尝试使用 conda -c conda-forge dask pyarrow
:
Problem signature:
Problem Event Name: APPCRASH
Application Name: python.exe
Application Version: 3.8.3150.1013
Application Timestamp: 5ed53446
Fault Module Name: arrow.dll
Fault Module Version: 0.0.0.0
Fault Module Timestamp: 5ecf56ac
Exception Code: c000001d
Exception Offset: 0000000000521587
OS Version: 6.3.9600.2.0.0.16.7
Locale ID: 1033
Additional Information 1: e863
Additional Information 2: e8638a01b9fb70505b0604ef9b98f3c6
Additional Information 3: 1e47
Additional Information 4: 1e47c852f479606e071f3ea8f80878a1
自 2020 年 7 月 1 日起,更新包修复了此问题。我认为是 pyarrow
更新解决了这个问题。