pandas 数据框的 yaml 转储

Question

我想我会分享，因为我在 SO 上搜索了它，但找不到我需要的东西。

我想将 pd.DataFrame 转储到 yaml 文件中。

Timestamps 应该很好地显示，而不是默认值：

  date: !!python/object/apply:pandas._libs.tslibs.timestamps.Timestamp
  - 1589241600000000000
  - null
  - null

此外，输出应该是正确的 YaML 格式，即它应该可以被 yaml.load 读回。输出应该相当简洁，即更喜欢 'flow' 格式。

例如，这里有一些数据：

df = pd.DataFrame([
    dict(
        date=pd.Timestamp.now().normalize() - pd.Timedelta('1 day'),
        x=0,
        b='foo',
        c=[1,2,3,4],
        other_t=pd.Timestamp.now(),
    ),
    dict(
        date=pd.Timestamp.now().normalize(),
        x=1,
        b='bar',
        c=list(range(32)),
        other_t=pd.Timestamp.now(),
    ),
]).set_index('date')

Answer 1

这是我想出的。它对 Dumper 进行了一些自定义处理 Timestamp。输出更清晰，并且仍然有效的 yaml。加载后，yaml 识别有效日期时间的格式（我认为是 ISO 格式），并将它们重新创建为 datetime。实际上，我们可以把它读回一个DataFrame，其中这些datetime会自动转换成Timestamp。对索引进行小幅重置后，我们观察到新的 df 与原来的相同。

import yaml
from yaml import CDumper
from yaml.representer import SafeRepresenter
import datetime


class TSDumper(CDumper):
    pass

def timestamp_representer(dumper, data):
    return SafeRepresenter.represent_datetime(dumper, data.to_pydatetime())

TSDumper.add_representer(datetime.datetime, SafeRepresenter.represent_datetime)
TSDumper.add_representer(pd.Timestamp, timestamp_representer)

有了这个，现在我们可以做：

text = yaml.dump(
    df.reset_index().to_dict(orient='records'),
    sort_keys=False, width=72, indent=4,
    default_flow_style=None, Dumper=TSDumper,
)
print(text)

输出比较干净：

-   date: 2020-05-12 00:00:00
    x: 0
    b: foo
    c: [1, 2, 3, 4]
    other_t: 2020-05-13 02:30:23.422589
-   date: 2020-05-13 00:00:00
    x: 1
    b: bar
    c: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
        19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
    other_t: 2020-05-13 02:30:23.422613

现在，我们可以加载回来：

df2 = pd.DataFrame(yaml.load(text, Loader=yaml.SafeLoader)).set_index('date')

并且（请打鼓）：

df2.equals(df)
# True

pandas 数据框的 yaml 转储

yaml dump of a pandas dataframe

pyyaml

pandas