将具有可为空的 Int64 的数据帧从 pandas 导出到 R
Exporting dataframe with null-able Int64 from pandas to R
我正在尝试导出一个包含分类和 nullable integer columns 的数据框,以便 R 可以轻松读取它。
我把赌注押在 apache feather 上,但不幸的是 pandas 中的 Int64
数据类型似乎没有实现:
from pyarrow import feather
import pandas as pd
col1 = pd.Series([0, None, 1, 23]).astype('Int64')
col2 = pd.Series([1, 3, 2, 1]).astype('Int64')
df = pd.DataFrame({'a': col1, 'b': col2})
feather.write_feather(df, '/tmp/foo')
这是收到的错误消息:
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
<ipython-input-107-8cc611a30355> in <module>
----> 1 feather.write_feather(df, '/tmp/foo')
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write_feather(df, dest)
181 writer = FeatherWriter(dest)
182 try:
--> 183 writer.write(df)
184 except Exception:
185 # Try to make sure the resource is closed
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write(self, df)
92 # TODO(wesm): Remove this length check, see ARROW-1732
93 if len(df.columns) > 0:
---> 94 table = Table.from_pandas(df, preserve_index=False)
95 for i, name in enumerate(table.schema.names):
96 col = table[i]
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
551 if nthreads == 1:
552 arrays = [convert_column(c, f)
--> 553 for c, f in zip(columns_to_convert, convert_fields)]
554 else:
555 from concurrent import futures
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
551 if nthreads == 1:
552 arrays = [convert_column(c, f)
--> 553 for c, f in zip(columns_to_convert, convert_fields)]
554 else:
555 from concurrent import futures
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
542 e.args += ("Conversion failed for column {0!s} with type {1!s}"
543 .format(col.name, col.dtype),)
--> 544 raise e
545 if not field_nullable and result.null_count > 0:
546 raise ValueError("Field {} was non-nullable but pandas column "
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
536
537 try:
--> 538 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
539 except (pa.ArrowInvalid,
540 pa.ArrowNotImplementedError,
ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column a with type Int64')
是否有解决方法允许我使用这种特殊的 Int64
数据类型,最好使用 pyarrow?
使用最新的 Arrow 版本 (pyarrow 0.15.0),并且在使用 pandas 开发版本时,现在支持:
In [1]: from pyarrow import feather
...: import pandas as pd
...:
...: col1 = pd.Series([0, None, 1, 23]).astype('Int64')
...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64')
...:
...: df = pd.DataFrame({'a': col1, 'b': col2})
...:
...: feather.write_feather(df, '/tmp/foo')
In [2]: feather.read_table('/tmp/foo')
Out[2]:
pyarrow.Table
a: int64
b: int64
您可以看到生成的箭头 table(读回时)正确地包含整数列。
所以要在发布版本中使用它,它要等到 pandas 1.0。
目前(不使用 pandas master),您有两个解决方法:
将列转换为对象数据类型列(df['a'] = df['a'].astype(object)
),然后写入feather。对于那些对象列(带有整数和缺失值),pyarrow 将正确推断它是整数。
Monkeypatch pandas 目前(直到下一个 pandas 版本):
pd.arrays.IntegerArray.__arrow_array__ = lambda self, type: pyarrow.array(self._data, mask=self._mask, type=type)
有了这个,用 pyarrow / feather 编写可为空的整数列应该开箱即用(为此你仍然需要最新的 pyarrow 0.15.0)。
请注意,将羽毛文件读回 pandas DataFrame 目前仍会生成浮点列(如果有缺失值),因为这是箭头整数到 [= 的默认转换35=]。在转换为 pandas 时,还需要保留那些特定的 pandas 类型(参见 https://issues.apache.org/jira/browse/ARROW-2428)。
我正在尝试导出一个包含分类和 nullable integer columns 的数据框,以便 R 可以轻松读取它。
我把赌注押在 apache feather 上,但不幸的是 pandas 中的 Int64
数据类型似乎没有实现:
from pyarrow import feather
import pandas as pd
col1 = pd.Series([0, None, 1, 23]).astype('Int64')
col2 = pd.Series([1, 3, 2, 1]).astype('Int64')
df = pd.DataFrame({'a': col1, 'b': col2})
feather.write_feather(df, '/tmp/foo')
这是收到的错误消息:
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
<ipython-input-107-8cc611a30355> in <module>
----> 1 feather.write_feather(df, '/tmp/foo')
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write_feather(df, dest)
181 writer = FeatherWriter(dest)
182 try:
--> 183 writer.write(df)
184 except Exception:
185 # Try to make sure the resource is closed
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write(self, df)
92 # TODO(wesm): Remove this length check, see ARROW-1732
93 if len(df.columns) > 0:
---> 94 table = Table.from_pandas(df, preserve_index=False)
95 for i, name in enumerate(table.schema.names):
96 col = table[i]
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
551 if nthreads == 1:
552 arrays = [convert_column(c, f)
--> 553 for c, f in zip(columns_to_convert, convert_fields)]
554 else:
555 from concurrent import futures
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
551 if nthreads == 1:
552 arrays = [convert_column(c, f)
--> 553 for c, f in zip(columns_to_convert, convert_fields)]
554 else:
555 from concurrent import futures
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
542 e.args += ("Conversion failed for column {0!s} with type {1!s}"
543 .format(col.name, col.dtype),)
--> 544 raise e
545 if not field_nullable and result.null_count > 0:
546 raise ValueError("Field {} was non-nullable but pandas column "
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
536
537 try:
--> 538 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
539 except (pa.ArrowInvalid,
540 pa.ArrowNotImplementedError,
ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column a with type Int64')
是否有解决方法允许我使用这种特殊的 Int64
数据类型,最好使用 pyarrow?
使用最新的 Arrow 版本 (pyarrow 0.15.0),并且在使用 pandas 开发版本时,现在支持:
In [1]: from pyarrow import feather
...: import pandas as pd
...:
...: col1 = pd.Series([0, None, 1, 23]).astype('Int64')
...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64')
...:
...: df = pd.DataFrame({'a': col1, 'b': col2})
...:
...: feather.write_feather(df, '/tmp/foo')
In [2]: feather.read_table('/tmp/foo')
Out[2]:
pyarrow.Table
a: int64
b: int64
您可以看到生成的箭头 table(读回时)正确地包含整数列。 所以要在发布版本中使用它,它要等到 pandas 1.0。
目前(不使用 pandas master),您有两个解决方法:
将列转换为对象数据类型列(
df['a'] = df['a'].astype(object)
),然后写入feather。对于那些对象列(带有整数和缺失值),pyarrow 将正确推断它是整数。Monkeypatch pandas 目前(直到下一个 pandas 版本):
pd.arrays.IntegerArray.__arrow_array__ = lambda self, type: pyarrow.array(self._data, mask=self._mask, type=type)
有了这个,用 pyarrow / feather 编写可为空的整数列应该开箱即用(为此你仍然需要最新的 pyarrow 0.15.0)。
请注意,将羽毛文件读回 pandas DataFrame 目前仍会生成浮点列(如果有缺失值),因为这是箭头整数到 [= 的默认转换35=]。在转换为 pandas 时,还需要保留那些特定的 pandas 类型(参见 https://issues.apache.org/jira/browse/ARROW-2428)。