无法解析 modin 数据框中的一列 json 字符串(适用于 pandas)

unable to parse a column of json strings in modin dataframe (works in pandas)

我有一个包含 json 个字符串的数据框,我想将其转换为 json 个对象。 df.col.apply(json.loads) 适用于 pandas,但在使用 modin 数据帧时失败。

示例:

import pandas
import modin.pandas
import json

pandas.DataFrame.from_dict({'a': ['{}']}).a.apply(json.loads)

0    {}
Name: a, dtype: object


modin.pandas.DataFrame.from_dict({'a': ['{}']}).a.apply(json.loads)

TypeError: the JSON object must be str, bytes or bytearray, not float

此问题也在 GitHub 上提出,并在此处得到解答:https://github.com/modin-project/modin/issues/616

The error is coming from the error checking component of the run, where we call the apply (or agg) on an empty DataFrame to determine the return type and let pandas handle the error checking (Link).

Locally, I can reproduce this issue and have fixed it by changing the line to perform the operation on one line of the Series. This may affect the performance, so I need to do some more tuning to see if there is a way to speed it up and still be robust. After the fix the overhead of that check is ~10ms for 256 columns and I don't think we want error checking to take that long.

在修复发布之前,可以通过使用也适用于空数据的代码来解决此问题 - 例如:

def safe_loads(x)
  try:
    return json.loads(x)
  except:
    return None

modin.pandas.DataFrame.from_dict({'a': ['{}']}).a.apply(safe_loads)