在 pandas returns ValueError 中加载非常大的 jsonl
Loading a very large jsonl in pandas returns ValueError
我正在尝试使用 pandas
中的块加载一个非常大的 jsonl 文件(>50 GB)
reader = pd.read_json("January.jsonl", lines = True, chunksize = 10000)
for chunk in reader:
df = chunk
这段代码启动,运行一段时间然后returns这个错误
self._parse_no_numpy()
File "C:\Users\anaconda3\lib\site-packages\pandas\io\json\_json.py", line 1089, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value
是我的文件有问题还是其他问题?
您的文件中的 JSON 数据似乎格式不正确。例如,尝试加载以下“JSON”数据 - 请注意 id 77 格式不正确。
{"created_at": "2019-01-01 23:45:01", "id":1}
{"created_at": "2019-01-01 23:45:01", "id":2}
{"created_at": "2019-01-01 23:45:01", "id":3}
{"created_at": "2019-01-01 23:45:01", "id":4}
{"created_at": "2019-01-01 23:45:01", "id":5}
{"created_at": "2019-01-01 23:45:01", "id":6}
{"created_at": "2019-01-01 23:45:01", "id":7}
{"created_at": "2019-01-01 23:45:01", "id":8}
{"created_at": "2019-01-01 23:45:01", "id":11}
{"created_at": "2019-01-01 23:45:01", "id":22}
{"created_at": "2019-01-01 23:45:01", "id":33}
{"created_at": "2019-01-01 23:45:01", "id":44}
{"created_at": "2019-01-01 23:45:01", "id":55}
{"created_at": "2019-01-01 23:45:01", "id":66}
{i"created_at": "2019-01-01 23:45:01", "id":77}
{"created_at": "2019-01-01 23:45:01", "id":88}
{"created_at": "2019-01-01 23:45:01", "id":99}
然后运行这个代码。
>>> import pandas as pd
>>> reader = pd.read_json("January.jsonl", lines=True, chunksize=1)
>>> for r in reader:
... print(r)
并查看输出:
12 2019-01-01 23:45:01 55
created_at id
13 2019-01-01 23:45:01 66
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 779, in __next__
obj = self._get_object_parser(lines_json)
File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 753, in _get_object_parser
obj = FrameParser(json, **kwargs).parse()
File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 857, in parse
self._parse_no_numpy()
File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 1089, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value
错误与您收到的错误相同。您将需要找到格式错误的数据并修复它。您可以尝试逐行读取 JSON 数据以找出错误存在的位置并提取行以检查它们。
f = open("January.jsonl")
lines=f.readlines()
for line_no, line in enumerate(lines):
try:
data = json.loads(line)
except Exception:
print(line_no)
print(line)
我正在尝试使用 pandas
中的块加载一个非常大的 jsonl 文件(>50 GB)reader = pd.read_json("January.jsonl", lines = True, chunksize = 10000)
for chunk in reader:
df = chunk
这段代码启动,运行一段时间然后returns这个错误
self._parse_no_numpy()
File "C:\Users\anaconda3\lib\site-packages\pandas\io\json\_json.py", line 1089, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value
是我的文件有问题还是其他问题?
您的文件中的 JSON 数据似乎格式不正确。例如,尝试加载以下“JSON”数据 - 请注意 id 77 格式不正确。
{"created_at": "2019-01-01 23:45:01", "id":1}
{"created_at": "2019-01-01 23:45:01", "id":2}
{"created_at": "2019-01-01 23:45:01", "id":3}
{"created_at": "2019-01-01 23:45:01", "id":4}
{"created_at": "2019-01-01 23:45:01", "id":5}
{"created_at": "2019-01-01 23:45:01", "id":6}
{"created_at": "2019-01-01 23:45:01", "id":7}
{"created_at": "2019-01-01 23:45:01", "id":8}
{"created_at": "2019-01-01 23:45:01", "id":11}
{"created_at": "2019-01-01 23:45:01", "id":22}
{"created_at": "2019-01-01 23:45:01", "id":33}
{"created_at": "2019-01-01 23:45:01", "id":44}
{"created_at": "2019-01-01 23:45:01", "id":55}
{"created_at": "2019-01-01 23:45:01", "id":66}
{i"created_at": "2019-01-01 23:45:01", "id":77}
{"created_at": "2019-01-01 23:45:01", "id":88}
{"created_at": "2019-01-01 23:45:01", "id":99}
然后运行这个代码。
>>> import pandas as pd
>>> reader = pd.read_json("January.jsonl", lines=True, chunksize=1)
>>> for r in reader:
... print(r)
并查看输出:
12 2019-01-01 23:45:01 55
created_at id
13 2019-01-01 23:45:01 66
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 779, in __next__
obj = self._get_object_parser(lines_json)
File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 753, in _get_object_parser
obj = FrameParser(json, **kwargs).parse()
File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 857, in parse
self._parse_no_numpy()
File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 1089, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value
错误与您收到的错误相同。您将需要找到格式错误的数据并修复它。您可以尝试逐行读取 JSON 数据以找出错误存在的位置并提取行以检查它们。
f = open("January.jsonl")
lines=f.readlines()
for line_no, line in enumerate(lines):
try:
data = json.loads(line)
except Exception:
print(line_no)
print(line)