当数据框包含混合数据类型时,Pyarrow from_pandas 使解释器崩溃

Pyarrow from_pandas crashes the interpreter when a dataframe contains mixed dtypes

使用 pyarrow 0.6.0(或更低版本),以下代码片段导致 Python 解释器崩溃:

data = pd.DataFrame({'a': [1, True]})
pa.Table.from_pandas(data)

"The Python interpreter has stopped working"(低于 windows)

经过一些调查,问题在 pyarrow 0.7.0 中解决了 Jira issue and more precisely this commit 使用与问题中相同的片段,现在而不是崩溃解释器我们得到以下错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "table.pxi", line 755, in pyarrow.lib.Table.from_pandas
File "C:\Temp\tt\Tools\Anaconda3.4.3.1\envs\GMF_test3\lib\site-packages\pyarrow\pandas_compat.py", line 227, in dataframe_to_arrays
    col, type=type, timestamps_to_ms=timestamps_to_ms
File "array.pxi", line 225, in pyarrow.lib.Array.from_pandas
File "error.pxi", line 77, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type bool but can only handle these ty
pes: integer

解决此问题的一种可能性是,当您掌握数据时,在发生异常时转换具有混合数据类型的列,如下所示(并且可能记录异常,因为这不是常见错误):

import pandas as pd
import pyarrow as pa
import logging

logger = logging.getLogger(__name__)

data = pd.DataFrame({'a': [1, True], 'b': [1, 2]})


def convert_type_if_needed(type_to_select, df, col_name):
    types = []
    for i in df[col_name]:
        types.append(type(i))
    if type_to_select in types:
        return df.astype({col_name: type_to_select})
    else:
        raise TypeError(str(type_to_select) + " is not in the dataframe, conversion impossible")


try:
    table = pa.Table.from_pandas(data)
except pa.lib.ArrowInvalid as e:
    logger.warning(e)
    data = convert_type_if_needed(int, data, 'a')
    table = pa.Table.from_pandas(data)

print(table)

最终产生:

pyarrow.Table
Error converting from Python objects to Int64: Got Python object of type bool but can only handle these types: integer
a: int32
b: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"columns": [{"name": "a", "numpy_type": "int32", "pandas_type":'
            b' "int32", "metadata": null}, {"name": "b", "numpy_type": "int64"'
            b', "pandas_type": "int64", "metadata": null}, {"name": "__index_l'
            b'evel_0__", "numpy_type": "int64", "pandas_type": "int64", "metad'
            b'ata": null}], "index_columns": ["__index_level_0__"], "pandas_ve'
            b'rsion": "0.20.3"}'}