如何修复可能损坏的 json 文件?大括号字符“{”的问题 (Python3)
How to fix a possibly corrupted json file? Problems with a curly bracket character "{" (Python3)
这个问题太奇怪了,我什至不知道怎么问,但我会试试的。我有一些 json 文件,其中包含网络抓取数据,每个文件有多个条目,它们看起来像这样:
{
"doc_id": "some_number",
"url": "www.seedurl1.com",
"scrape_date": "2019-10-22 16:17:22",
"publish_date": "unknown",
"author": "unknown",
"urls_out": [
"https://www.something.com",
"https://www.sometingelse.com/smth"
],
"text": "lots of text here"
}
{
"doc_id": "some_other_number",
"url": "www.seedurl2.com/smth",
"scrape_date": "2019-10-22 17:44:40",
"publish_date": "unknown",
"author": "unknown",
"urls_out": [
"www.anotherurl.com/smth",
"http://urlx.com/smth.htm"
],
"text": "lots more text over here."
}
我试图对它们进行格式化,以便每个条目都在自己的行上,如下所示:
{"doc_id": blah blah....}
{"doc_id": blah blah blah...}
所以我这样做了:
# Read the file
f = codecs.open(file, 'r', encoding='utf-8-sig', errors='replace')
text = f.read()
f.close()
# Check if }{ was found;
# this prints nothing for original files but finds everything in a hand written file
pattern = '}{'
print('Before editing: ', (re.findall(pattern, text)))
# Getting rid of excess newlines and whitespaces
newtext = " ".join(text.split())
# Check if } { was found;
# this prints nothing for original files but finds everything in a hand written file
pattern = '} {'
print('After editing: ', (re.findall(pattern, newtext)))
# Put newlines in the right places
finaltext = re.sub('} {', '}\n{', newtext)
# Write the new JSON
newfile = file[:-5]+'_ED.json'
nf = codecs.open(newfile, 'w', encoding='utf-8', errors='replace')
nf.write(finaltext)
nf.close()
事实是,代码在具有相同结构的手写测试文件上运行完美,但在原始文件或从原始文件派生的较小测试文件上运行不佳。
我尝试在文本编辑器中分别简单搜索“}”和“{”,结果没问题。但是,如果我尝试搜索“}{”或“} {”,则什么也找不到。虽然我可以看到他们显然在那里。
最后一个发现:我试图在 Linux 中用 Nano 打开我的小测试文件的编辑版本,并移动到问题区域。出于某种原因,需要按两次右箭头键才能移动到“{”大括号上。所以那里显然有一些奇怪的东西。我如何找出什么?或者任何其他可能有帮助的建议?
最简单的解决方案就是创建一个 JSON 数组,以...
否则,我建议不要替换任何东西,只计算匹配的括号。
count = 0
objects = 0
with open('file.txt') as f:
for i, c in enumerate(f.read()):
if c == '\n':
continue
elif c == '{':
if i > 0 and count == 0:
print() # start new line before printing bracket
count += 1
elif c == '}':
count -= 1
if count == 0: # found a complete JSON object
objects += 1
print(c, end='')
print(f'\n\nfound {objects} objects') # for debugging
对于给定的文本,我最终得到了这个
{"doc_id": "some_number","url": "www.seedurl1.com","scrape_date": "2019-10-22 16:17:22","publish_date": "unknown","author": "unknown","urls_out": ["https://www.something.com","https://www.sometingelse.com/smth"],"text": "lots of text here"}
{"doc_id": "some_other_number","url": "www.seedurl2.com/smth","scrape_date": "2019-10-22 17:44:40","publish_date": "unknown","author": "unknown","urls_out": ["www.anotherurl.com/smth","http://urlx.com/smth.htm"],"text": "lots more text over here."}
found 2 objects
这是一种方法。
例如:
import json
with open(filename) as infile:
data = json.loads("[" + infile.read().replace("}\n{", "},\n{") + "]")
for i in data:
print(i)
输出:
{'doc_id': 'some_number', 'url': 'www.seedurl1.com',.....
{'doc_id': 'some_other_number', 'url': 'www.seedurl2.com/smth',.....
这是另一个解决方案,它与您尝试的解决方案很接近
import json
with open('test.txt') as f:
file = f.readlines()
file = ['{'+i+'}'for i in "".join("".join(file).split("\n"))[1:-1].split("}{")]
for i in file:
print(json.loads(i))
json
在这里仅用于验证个人 JSON。这给
{'doc_id': 'some_number', 'url': 'www.seedurl1.com', 'scrape_date': '2019-10-22 16:17:22', 'publish_date': 'unknown', 'author': 'unknown', 'urls_out': ['https://www.something.com', 'https://www.sometingelse.com/smth'], 'text': 'lots of text here'}
{'doc_id': 'some_number', 'url': 'www.seedurl1.com', 'scrape_date': '2019-10-22 16:17:22', 'publish_date': 'unknown', 'author': 'unknown', 'urls_out': ['https://www.something.com', 'https://www.sometingelse.com/smth'], 'text': 'lots of text here'}