提取我的 .json.gz 文件时，其中添加了一些字符 - 文件无法存储为 json 文件

Question

我正在尝试解压缩一些 .json.gz 文件，但是 gzip 向其中添加了一些字符，因此 JSON.

无法读取它

您认为问题是什么，我该如何解决？

如果我用7zip等解压软件解压，这个问题就没有了。

这是我的代码：

with gzip.open('filename' , 'rb') as f:
    json_content = json.loads(f.read())

这是我得到的错误：

Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)

我使用了这个代码：

with gzip.open ('filename', mode='rb') as f:
    print(f.read())

并意识到文件以b'开头（如下图）：

b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"

我认为 b' 是文件无法用于下一阶段的原因。您有什么解决方案可以删除 b' 吗？这个压缩文件有数百万个，我无法手动完成。

我在下面上传了这些文件的样本 link just a few json.gz files

Answer 1

问题不在于您在 print(f.read()) 中看到的 b 前缀，这只是意味着数据是 bytes 序列（即整数 ASCII 值）而不是UTF-8 字符序列（即常规 Python 字符串）— json.loads() 将接受其中任何一个。 JSONDecodeError 是因为压缩文件中的数据无效 JSON format, which is required. The format looks like something known as JSON Lines — Python 标准库 json 模块不（直接）支持。

Dunes' to the @Charles Duffy 曾一度将其标记为由于此格式问题而无法正常工作的副本。但是，从您在问题中添加 link 的示例文件来看，文件的每个行上似乎都有一个有效的 JSON 对象。如果您的所有文件都是如此，那么一个简单的解决方法是逐行处理每个文件。

我的意思是：

import json
import gzip


filename = '00_activities.json.gz'  # Sample file.

json_content = []
with gzip.open(filename , 'rb') as gzip_file:
    for line in gzip_file:  # Read one line.
        line = line.rstrip()
        if line:  # Any JSON data on it?
            obj = json.loads(line)
            json_content.append(obj)

print(json.dumps(json_content, indent=4))  # Pretty-print data parsed.

请注意，它打印的输出显示了有效 JSON 可能的样子。

提取我的 .json.gz 文件时，其中添加了一些字符 - 文件无法存储为 json 文件

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

python

json

gzip

jsonlines