`pandas to_json` 和 `read_json` 之间的文件大小差异很大

Question

设置

这个问题的基础是我正在使用 celery 接受的 celery and rabbitmq to create a distributed HDFStore messaging application that will pass pandas DataFrames to distributed processes (and then write to HDFStore). Because json is one of the task serialization protocols pandas 的 to_json() 和 read_json() 功能非常适合完成这个。

所以我的申请：

命中并 API 并拉下 pandas.DataFrame
使用 to_json()

DataFrame

将序列化值传递给 celery 工作人员
使用 celery.group 方法在另一侧重新创建 DataFrame

问题

我发现当我创建 HDFStore 时，它们比我只执行 for 循环并且没有序列化/反序列化对象（使用 json 时大 50 倍以上).所以我从中取出 celery 并用一个非常简单的函数重新创建它，重新创建了现象：

import numpy
import pandas
import random


def test_store_size(n_dfs, f_path):
    wj_store = pandas.HDFStore(f_path + 'from_json.h5', mode = 'w')
    nj_store = pandas.HDFStore(f_path + 'from_dfrm.h5', mode = 'w')

    ticks = []

    for i in numpy.arange(n_dfs):

        tag = _rnd_letters(5)
        print "working on " + str(i)

        index = pandas.DatetimeIndex(
                start = '01/01/2000', 
                periods = 1000, 
                freq = 'b'
        )

        df = pandas.DataFrame(
                numpy.random.rand(len(index), 3), 
                columns = ['a', 'b', 'c'], 
                index = index
        )

        nj_store[tag] = df

        stream = df.to_json(
                orient = 'index', 
                date_format = 'iso',
        )

        #stream = df.to_json(orient = 'values')
        wj_df = pandas.read_json(
                stream, 
                typ = 'frame', 
                orient = 'index', 
                dtype = _dtype_cols(df)
        )

        #wj_df = pandas.read_json(stream, convert_dates = False, orient = 'values')
        wj_store[tag] = wj_df

    wj_store.close()
    nj_store.close()

def _rnd_letters(n_letters):
    """Make random tags for the DataFrames"""
    s = 'abcdefghijklmnopqrstuvwxyz'
    return reduce(lambda x, y: x + y, [random.choice(s) for i in numpy.arange(n_letters)])

def _dtype_cols(df):
    """map the types for dytpes"""
    cols = df.columns.tolist()
    return dict([(col, numpy.float) for col in cols])

因此，如果您运行以下函数：

In [1]: test_store_size(n_dfs = 10, f_path = '/Users/benjamingross/Desktop/tq-')

以下是HDFStore之间的差距：

所以 21.4 MB 是 365 KB 的 59 倍！！！我正在处理 1,000 个 DataFrame，所以我的硬盘驱动器 (400MB) 上看起来很小的 space 结果是 24 GB，现在是 "Big Data" 问题（不应该）。

任何使用 to_json 和 read_json 获得序列化到 "behave" 的帮助（即序列化前后大小相同）将不胜感激。

我试过的

我已经尝试了 to_json / read_json 中的所有不同参数，包括 orient = values， 几乎可以工作 ，但是我需要序列化 [=39] =] 和 columns，其中，当 still 最终是原始大小的 60 倍。

Answer 1

如果您回顾程序的输出，您可能会收到如下消息：

In [7]: wj_df.to_hdf('test.h5', 'key')
PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->axis0] [items->None]

这不是特别明显，但是您的列名被读回为 unicode 而不是 python 字符串 - PyTables 在 python2 中处理得不好，所以它回退到酸洗。一个相对简单的解决方法是将列转换为字符串，如下所示。

wj_df.columns = wj_df.columns.astype(str)

下面这个问题在 GitHub 上有一个问题。

https://github.com/pydata/pandas/issues/5743

`pandas to_json` 和 `read_json` 之间的文件大小差异很大

large filesize difference between `pandas to_json` and `read_json`

python

serialization

json

celery

pandas

设置

问题

我试过的