在 pandas returns ValueError 中加载非常大的 jsonl

Loading a very large jsonl in pandas returns ValueError

我正在尝试使用 pandas

中的块加载一个非常大的 jsonl 文件(>50 GB)
reader = pd.read_json("January.jsonl", lines = True, chunksize = 10000)

for chunk in reader:
    df = chunk   

这段代码启动,运行一段时间然后returns这个错误

 self._parse_no_numpy()

  File "C:\Users\anaconda3\lib\site-packages\pandas\io\json\_json.py", line 1089, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None

ValueError: Expected object or value

是我的文件有问题还是其他问题?

您的文件中的 JSON 数据似乎格式不正确。例如,尝试加载以下“JSON”数据 - 请注意 id 77 格式不正确。

{"created_at": "2019-01-01 23:45:01", "id":1}
{"created_at": "2019-01-01 23:45:01", "id":2}
{"created_at": "2019-01-01 23:45:01", "id":3}
{"created_at": "2019-01-01 23:45:01", "id":4}
{"created_at": "2019-01-01 23:45:01", "id":5}
{"created_at": "2019-01-01 23:45:01", "id":6}
{"created_at": "2019-01-01 23:45:01", "id":7}
{"created_at": "2019-01-01 23:45:01", "id":8}
{"created_at": "2019-01-01 23:45:01", "id":11}
{"created_at": "2019-01-01 23:45:01", "id":22}
{"created_at": "2019-01-01 23:45:01", "id":33}
{"created_at": "2019-01-01 23:45:01", "id":44}
{"created_at": "2019-01-01 23:45:01", "id":55}
{"created_at": "2019-01-01 23:45:01", "id":66}
{i"created_at": "2019-01-01 23:45:01", "id":77}

{"created_at": "2019-01-01 23:45:01", "id":88}
{"created_at": "2019-01-01 23:45:01", "id":99}

然后运行这个代码。

>>> import pandas as pd
>>> reader = pd.read_json("January.jsonl", lines=True, chunksize=1)
>>> for r in reader:
...     print(r)

并查看输出:

12 2019-01-01 23:45:01  55
            created_at  id
13 2019-01-01 23:45:01  66
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 779, in __next__
    obj = self._get_object_parser(lines_json)
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 753, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 857, in parse
    self._parse_no_numpy()
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 1089, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value

错误与您收到的错误相同。您将需要找到格式错误的数据并修复它。您可以尝试逐行读取 JSON 数据以找出错误存在的位置并提取行以检查它们。

f = open("January.jsonl")
lines=f.readlines()
for line_no, line in enumerate(lines):
     try:
         data = json.loads(line)
     except Exception:
         print(line_no)
         print(line)