比 plotly dash Store 中的 json 更快的序列化（pickle、parquet、feather，...）？

Question

上下文

在使用 plotly Dash 的仪表板中，仅当组件（具有要考虑的周期的 DataPicker，因此要从数据库下载）更新时，我才需要从数据库执行昂贵的下载，然后将生成的 DataFrame 与其他组件一起使用（例如 Dropdowns 过滤 DataFrame）避免了昂贵的下载过程。

docs 建议使用 dash_core_components.Store 作为 return DataFrame 在 json 中序列化的回调的输出，而不是使用 Store 作为其他回调的输入需要从 json 反序列化为 DataFrame。

序列化from/toJSON很慢，每次我更新一个组件需要30秒来更新情节。

我尝试使用更快的序列化，如 pickle、parquet 和 feather，但在反序列化部分我收到一个错误，指出对象为空（使用 JSON 时没有出现此类错误）。

问题

是否可以在 Dash Store 中使用比 JSON 更快的方法（如 pickle、feather 或 parquet）执行序列化（它们对我的数据集来说大约需要一半的时间）？怎么样？

代码

import io
import traceback
import pandas as pd
from datetime import datetime, date, timedelta

import dash
import dash_core_components as dcc
import dash_html_components as html
import dash_bootstrap_components as dbc
from dash.dependencies import Input, Output
from plotly.subplots import make_subplots



app = dash.Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])
today = date.today()

app.layout = html.Div([
    dbc.Row(dbc.Col(html.H1('PMC'))),
    dbc.Row(dbc.Col(html.H5('analysis'))),
    html.Hr(),
    html.Br(),

    dbc.Container([
        dbc.Row([
            dbc.Col(
                dcc.DatePickerRange(
                    id='date_ranges',
                    start_date=today - timedelta(days=20),
                    end_date=today,
                    max_date_allowed=today, display_format='MMM Do, YY',
                ),
                width=4
            ),
        ]),
        dbc.Row(
            dbc.Col(
                dcc.Dropdown(
                    id='dd_ycolnames',
                    options=options,
                    value=default_options,
                    multi=True,
                ),
            ),
        ),
    ]),

    dbc.Row([
        dbc.Col(
            dcc.Graph(
                id='graph_subplots',
                figure={},
            ),
            width=12
        ),
    ]),

    dcc.Store(id='store')
])


@app.callback(
    Output('store', 'data'),
    [
        Input(component_id='date_ranges', component_property='start_date'),
        Input(component_id='date_ranges', component_property='end_date')
    ]
)
def load_dataset(date_ranges_start, date_ranges_end):
     # some expensive clean data step
     logger.info('loading dataset...')
     date_ranges1_start = datetime.strptime(date_ranges_start, '%Y-%m-%d')
     date_ranges1_end = datetime.strptime(date_ranges_end, '%Y-%m-%d')
     df = expensive_load_from_db(date_ranges1_start, date_ranges1_end)
     logger.info('dataset to json...')
     #return df.to_json(date_format='iso', orient='split')
     return df.to_parquet()                                 # <----------------------


@app.callback(
    Output(component_id='graph_subplots', component_property='figure'),
    [
        Input(component_id='store', component_property='data'),
        Input(component_id='dd_ycolnames', component_property='value'),
    ],
)
def update_plot(df_bin, y_colnames):
    logger.info('dataset from json')
    #df = pd.read_json(df_bin, orient='split')
    df = pd.read_parquet(io.BytesIO(df_bin))             # <----------------------
    logger.info('building plot...')
    traces = []
    for y_colname in y_colnames:
        if df[y_colname].dtype == 'bool':
            df[y_colname] = df[y_colname].astype('int')
        traces.append(
            {'x': df['date'], 'y': df[y_colname].values, 'name': y_colname},
        )
    fig = make_subplots(
        rows=len(y_colnames), cols=1, shared_xaxes=True, vertical_spacing=0.1
    )
    fig.layout.height = 1000
    for i, trace in enumerate(traces):
        fig.append_trace(trace, i+1, 1)
    logger.info('plotted')
    return fig


if __name__ == '__main__':
    app.run_server(host='localhost', debug=True)

错误文本

OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet file size is 0 bytes

Answer 1

由于客户端和服务器之间的数据交换，您目前仅限于 JSON 序列化。规避此限制的一种方法是通过 ServersideOutput component from dash-extensions，它将数据存储在服务器上。它默认使用文件存储和 pickle 序列化，但您也可以使用其他存储（例如 Redis）and/or 序列化协议（例如 arrow）。这是一个小例子，

import time
import dash_core_components as dcc
import dash_html_components as html
import plotly.express as px
from dash_extensions.enrich import Dash, Output, Input, State, ServersideOutput

app = Dash(prevent_initial_callbacks=True)
app.layout = html.Div([
    html.Button("Query data", id="btn"), dcc.Dropdown(id="dd"), dcc.Graph(id="graph"),
    dcc.Loading(dcc.Store(id='store'), fullscreen=True, type="dot")
])


@app.callback(ServersideOutput("store", "data"), Input("btn", "n_clicks"))
def query_data(n_clicks):
    time.sleep(1)
    return px.data.gapminder()  # no JSON serialization here


@app.callback(Input("store", "data"), Output("dd", "options"))
def update_dd(df):
    return [{"label": column, "value": column} for column in df["year"]]  # no JSON de-serialization here


@app.callback(Output("graph", "figure"), [Input("dd", "value"), State("store", "data")])
def update_graph(value, df):
    df = df.query("year == {}".format(value))  # no JSON de-serialization here
    return px.sunburst(df, path=['continent', 'country'], values='pop', color='lifeExp', hover_data=['iso_alpha'])


if __name__ == '__main__':
    app.run_server()

比 plotly dash Store 中的 json 更快的序列化（pickle、parquet、feather，...）？

Faster serializations (pickle, parquet, feather, ...) than json in plotly dash Store?

python

pickle

parquet

feather

plotly-dash

上下文

问题

代码

错误文本