如何在 `DataFrame.to_json` 期间获得浮点数的精确表示？

Question

我观察到 DataFrame.to_json 的以下行为：

>>> df = pd.DataFrame([[eval(f'1.12345e-{i}') for i in range(8, 20)]])
>>> df
             0             1             2             3             4             5             6             7             8             9             10            11
0  1.123450e-08  1.123450e-09  1.123450e-10  1.123450e-11  1.123450e-12  1.123450e-13  1.123450e-14  1.123450e-15  1.123450e-16  1.123450e-17  1.123450e-18  1.123450e-19
>>> print(df.to_json(indent=2, orient='index'))
{
  "0":{
    "0":0.0000000112,
    "1":0.0000000011,
    "2":0.0000000001,
    "3":0.0,
    "4":0.0,
    "5":0.0,
    "6":0.0,
    "7":0.0,
    "8":1.12345e-16,
    "9":1.12345e-17,
    "10":1.12345e-18,
    "11":1.12345e-19
  }
}

所以所有小至 1e-16 的数字似乎都四舍五入到小数点后 10 位（与 double_precision 的默认值一致），但所有较小的值都被精确表示。为什么会这样？如何关闭较大值的小数舍入（即改用科学记数法）？

>>> pd.__version__
'1.3.1'

作为参考，标准库的 json module 不会这样做：

>>> import json
>>> print(json.dumps([eval(f'1.12345e-{i}') for i in range(8, 20)], indent=2))
[
  1.12345e-08,
  1.12345e-09,
  1.12345e-10,
  1.12345e-11,
  1.12345e-12,
  1.12345e-13,
  1.12345e-14,
  1.12345e-15,
  1.12345e-16,
  1.12345e-17,
  1.12345e-18,
  1.12345e-19
]

Answer 1

指的是/pandas/io/json/_json.pycodebase，默认precision整数最大为10，请看下面的代码库..

def to_json(
    path_or_buf,
    obj,
    orient: Optional[str] = None,
    date_format: str = "epoch",
    double_precision: int = 10,
    force_ascii: bool = True,
    date_unit: str = "ms",
    default_handler: Optional[Callable[[Any], JSONSerializable]] = None,
    lines: bool = False,
    compression: Optional[str] = "infer",
    index: bool = True,
    indent: int = 0,

如果你应用最大精度，你将得到低于..

>>> print(df.to_json(indent=2, orient='records', double_precision=15))
[
  {
    "0":0.0000000112345,
    "1":0.00000000112345,
    "2":0.000000000112345,
    "3":0.000000000011234,
    "4":0.000000000001123,
    "5":0.000000000000112,
    "6":0.000000000000011,
    "7":0.000000000000001,
    "8":1.12345e-16,
    "9":1.12345e-17,
    "10":1.12345e-18,
    "11":1.12345e-19,
    "12":1.12345e-20,
    "13":1.12345e-21,
    "14":1.12345e-22,
    "15":1.12345e-23,
    "16":1.12345e-24,
    "17":1.12345e-25,
    "18":1.12345e-26,
    "19":1.12345e-27,
    "20":1.12345e-28,
    "21":1.12345e-29,
    "22":1.12345e-30,
    "23":1.12345e-31,
    "24":1.12345e-32,
    "25":1.12345e-33,
    "26":1.12345e-34,
    "27":1.12345e-35,
    "28":1.12345e-36,
    "29":1.12345e-37,
    "30":1.12345e-38,
    "31":1.12345e-39
  }
]

注意：如果您使用 precision 超过 15，您将得到值错误。

ValueError: Invalid value '20' for option 'double_precision', max is '15'

所以，从某种意义上说，这与 json.dumps 不同。

Answer 2

我不确定是否可以使用 pd.DataFrame.to_json, but we can use pd.DataFrame.to_dict, json, and pd.read_json 实现此目的以从 pandas 数据帧实现完全精确的 json 表示。

json_df = json.dumps(df.to_dict('index'), indent=2)
>>> print(json_df)
{
  "0": {
    "0": 1.12345e-08,
    "1": 1.12345e-09,
    "2": 1.12345e-10,
    "3": 1.12345e-11,
    "4": 1.12345e-12,
    "5": 1.12345e-13,
    "6": 1.12345e-14,
    "7": 1.12345e-15,
    "8": 1.12345e-16,
    "9": 1.12345e-17,
    "10": 1.12345e-18,
    "11": 1.12345e-19
  }
}

要读回它，我们可以这样做：

>>> pd.read_json(json_df, orient='index')
             0             1             2   ...            9             10            11
0  1.123450e-08  1.123450e-09  1.123450e-10  ...  1.123450e-17  1.123450e-18  1.123450e-19

[1 rows x 12 columns]

Answer 3

首先，您无法获得数字的“精确浮点表示法”，因为它们没有精确的二进制表示法。例如。 十进制 1.2345e-8 可以精确表示为 5 个十进制数字（加上指数），但是当转换为二进制时，它是一个重复分数，因此不能精确地用有限数量的二进制数字表示。所以难免会有舍入误差。您可以通过打印到超高精度来看到这一点：

>>> [print(f'{eval(f"1.12345e-{i}"):.17e}') for i in range(8,20)]
1.12345000000000004e-08
1.12345000000000009e-09
1.12345000000000001e-10
1.12344999999999994e-11
1.12345000000000002e-12
1.12345000000000000e-13
1.12345000000000006e-14
1.12344999999999994e-15
1.12345000000000004e-16
1.12344999999999998e-17
1.12344999999999998e-18
1.12345000000000008e-19

除此之外，我想您的问题是 pandas to_json 实现。似乎正在发生的是精度（由 double_precision 指定）是固定数量的 小数位 （不是 有效数字 ），如果该值低于 1e-15，则忽略它。

我认为这是一个错误，因为 1e-10（double_precision 的默认值）和 1e-15 之间的任何数字都将完全丢失 - 你只会得到零。

即使您使用 double_precision=15，当您接近极限时，您会得到越来越不准确的序列化值，直到突然它们再次变得更准确，所以我建议只选择固定数量的可能被忽略的小数位是有问题的，至少应该有一个选项来序列化为固定数量的有效数字，甚至可能默认情况下与 json模块。

我会向 pandas 开发人员提出这两个问题。

至于快速解决方案，请采纳@maneblusser 的建议：先使用 to_dict，然后使用 json.dumps。

Answer 4

pd.DataFrame.to_json 使用内部库 pandas._libs.json 而不是标准 json 模块。这解释了行为上的差异。前者在内部“规范化”数字并且不公开 API 来控制它。因此，您有以下选择：

使用标准 json 库（如前所述）转换为字典和转储：

>>> print(json.dumps(df.to_dict(orient='records'), indent=2))

[
  {
    "0": 1.12345e-08,
    "1": 1.12345e-09,
    "2": 1.12345e-10,
    "3": 1.12345e-11,
    "4": 1.12345e-12,
    "5": 1.12345e-13,
    "6": 1.12345e-14,
    "7": 1.12345e-15,
    "8": 1.12345e-16,
    "9": 1.12345e-17,
    "10": 1.12345e-18,
    "11": 1.12345e-19
  }
]

这是完全合法的解决方案。

您可以使用 CSV 格式代替 JSON 并指定所需的浮点格式：

>>> print(df.to_csv(float_format='%.10e', index=False))

0,1,2,3,4,5,6,7,8,9,10,11
1.1234500000e-08,1.1234500000e-09,1.1234500000e-10,1.1234500000e-11,1.1234500000e-12,1.1234500000e-13,1.1234500000e-14,1.1234500000e-15,1.1234500000e-16,1.1234500000e-17,1.1234500000e-18,1.1234500000e-19

另一种选择是在“规范化”开始之前将值转换为字符串：

>>> print(df.astype(str).to_json(indent=2, orient='index'))

{
  "0":{
    "0":"1.12345e-08",
    "1":"1.12345e-09",
    "2":"1.12345e-10",
    "3":"1.12345e-11",
    "4":"1.12345e-12",
    "5":"1.12345e-13",
    "6":"1.12345e-14",
    "7":"1.12345e-15",
    "8":"1.12345e-16",
    "9":"1.12345e-17",
    "10":"1.12345e-18",
    "11":"1.12345e-19"
  }
}

读回JSON时需要特别注意字符串的转换。

最后，如果您需要精确的值，只需使用二进制格式，例如 parquet 或 pickle。

如何在 `DataFrame.to_json` 期间获得浮点数的精确表示？

How to get an exact representation of floats during `DataFrame.to_json`?

python

floating-point

json

pandas