DataFrame 性能警告

Question

我收到来自 Pandas

的性能警告

/usr/local/lib/python3.4/dist-packages/pandas/core/generic.py:1471: 
PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_values] [items->['int', 'str']]

我已经阅读了关于 github 的几个问题和这里的问题，他们都说这是因为我在一栏中混合了类型，但我肯定不是。简单例子如下：

import pandas as pd
df = pd.DataFrame(columns=['int', 'str'])
df = df.append({ 'int': 0, 'str': '0'}, ignore_index=True)
df = df.append({ 'int': 1, 'str': '1'}, ignore_index=True)
for _, row in df.iterrows():
   print(type(row['int']), type(row['str']))

# <class 'int'> <class 'str'>
# <class 'int'> <class 'str'>

# however
df.dtypes
# int    object
# str    object
# dtype: object

# the following causes the warning
df.to_hdf('table.h5', 'table')

这是关于什么的，我能做什么？

Answer 1

您需要在适当的情况下将数据框系列转换为数字类型。

有两种主要方法可以实现整数：

# Method 1
df['col'] = df['col'].astype(int)

# Method 2
df['col'] = pd.to_numeric(df['col'], downcast='integer')

这可确保数据类型正确映射到 C-types，从而使数据能够以 HDF5 格式（PyTables 使用的格式）存储，而无需 pickling。

Answer 2

基于@jpp 的回答，我在自己的数据中将这个问题追溯到将大型 CSV 加载到 pandas 数据时默认使用 int64 和 float64 数据类型帧.

解决方法：

for c in df.columns[df.dtypes=='float64'].values:
        df[c] = df[c].astype('float')
    
    for c in df.columns[df.dtypes=='int64'].values:
        df[c] = df[c].astype('int')

现在可以在没有警告的情况下导出到 HDF。当然，您可以自动执行此操作，但就我的目的而言，这是 'good enough'.

DataFrame 性能警告

DataFrame performance warning

python

hdf5

pytables

python-3.x

pandas