jq 或 python 脚本删除 json 字段中日期后的文本
jq or python script to delete text after date in json field
我有一个包含数百个条目的 json 文件,例如:
{
"url":"http://example.com/10618/",
"metatag.eprints.publication":"Journal of Corporate Real Estate",
"metatag.eprints.title":"Corporate Real Estate Strategy",
"metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006) Corporate Real Estate Strategy. Journal of Corporate Real Estate"}
{
"url":"http://example.com/23552/",
"metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
"metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
"metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012) Long-term survival from coronary endarterectomies in coronary artery disease. European Journal of Cardio-Thoracic Surgery"}
任何人都可以帮助编写一个 jq 或 python 脚本,对于每个块,它会更改 "metatag.eprints.citation" 以便删除日期之后的所有文本?
所以上面的块将变成:
{
"url":"http://example.com/10618/",
"metatag.eprints.publication":"Journal of Corporate Real Estate",
"metatag.eprints.title":"Corporate Real Estate Strategy",
"metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006)"}
{
"url":"http://example.com/23552/",
"metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
"metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
"metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012)"}
一旦你的格式像你的问题一样,你可以使用 itertools.groupby
按左括号分组,用 str.join 加入行并使用 json.loads 得到一个字典,然后这只是按键访问并将更新的数据写入临时文件的问题。最后使用 shutil.move
替换原始文件,如果你想要一个全新的文件只需将 NamedTemporaryFile
更改为使用 open
:
from tempfile import NamedTemporaryFile
from shutil import move
from itertools import groupby
import json
with open("in.txt") as f, NamedTemporaryFile("w", dir=".",delete=False) as out:
for k, v in groupby(f, key=lambda x: x.lstrip().startswith("{")):
if not k:
d = json.loads("{" + "".join(v))
v = d["metatag.eprints.citation"]
d["metatag.eprints.citation"] = v[:v.find(")")+1]
json.dump(d, out)
out.write("\n")
move(out.name,"in.txt")
in.txt之前:
{
"url":"http://example.com/10618/",
"metatag.eprints.publication":"Journal of Corporate Real Estate",
"metatag.eprints.title":"Corporate Real Estate Strategy",
"metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006) Corporate Real Estate Strategy. Journal of Corporate Real Estate"}
{
"url":"http://example.com/23552/",
"metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
"metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
"metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012) Long-term survival from coronary endarterectomies in coronary artery disease. European Journal of Cardio-Thoracic Surgery"}
in.txt 之后:
{"url": "http://example.com/10618/", "metatag.eprints.publication": "Journal of Corporate Real Estate", "metatag.eprints.citation": "Adair, P, McGrogan, WS, and Webb, JR (2006)", "metatag.eprints.title": "Corporate Real Estate Strategy"}
{"url": "http://example.com/23552/", "metatag.eprints.publication": "European Journal of Cardio-Thoracic Surgery", "metatag.eprints.citation": "Aaron, P, Jones, K, Pallin, C, and Nash, R (2012)", "metatag.eprints.title": "Long-term survival from coronary endarterectomies in coronary artery disease"}
如果您以后必须编辑它,您可以简单地遍历该文件并 json.loads
每行获取一个字典,再次使用密钥更新并写入文件。每行一个会让你的生活更轻松。
如果您可以在日期之前有一个左括号,您可以使用正则表达式搜索特定的子字符串,括号之间的 4 位数字:
r = re.compile("\(\d{4}\)")
for k, v in groupby(f, key=lambda x: x.lstrip().startswith("{")):
if not k:
d = json.loads("{" + "".join(v))
v = d["metatag.eprints.citation"]
d["metatag.eprints.citation"] = v[:next(r.finditer(v)).end()]
json.dump(d, out)
out.write("\n")
如果您得到一个空文件,那么您的数据实际上必须是每行一个字典,因此只需遍历文件对象并应用相同的逻辑即可:
with open("in.txt") as f, NamedTemporaryFile("w", dir=".",delete=False) as out:
for line in f:
d = json.loads(line)
v = d["metatag.eprints.citation"]
d["metatag.eprints.citation"] = v[:v.find(")")+1]
json.dump(d, out)
out.write("\n")
move(out.name,"in.txt")
jq '.["metatag.eprints.citation"] |= match(".*?\\)").string // .'
需要 jq 1.5。这样做是将 metatag.eprints.citation
的值设置为将自身与正则表达式 .*?\)
匹配的结果,这将匹配第一个右括号之前的所有内容。如果由于某种原因没有右括号,我们使用替代运算符 //
将值设置回原来的值。
我有一个包含数百个条目的 json 文件,例如:
{
"url":"http://example.com/10618/",
"metatag.eprints.publication":"Journal of Corporate Real Estate",
"metatag.eprints.title":"Corporate Real Estate Strategy",
"metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006) Corporate Real Estate Strategy. Journal of Corporate Real Estate"}
{
"url":"http://example.com/23552/",
"metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
"metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
"metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012) Long-term survival from coronary endarterectomies in coronary artery disease. European Journal of Cardio-Thoracic Surgery"}
任何人都可以帮助编写一个 jq 或 python 脚本,对于每个块,它会更改 "metatag.eprints.citation" 以便删除日期之后的所有文本?
所以上面的块将变成:
{
"url":"http://example.com/10618/",
"metatag.eprints.publication":"Journal of Corporate Real Estate",
"metatag.eprints.title":"Corporate Real Estate Strategy",
"metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006)"}
{
"url":"http://example.com/23552/",
"metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
"metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
"metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012)"}
一旦你的格式像你的问题一样,你可以使用 itertools.groupby
按左括号分组,用 str.join 加入行并使用 json.loads 得到一个字典,然后这只是按键访问并将更新的数据写入临时文件的问题。最后使用 shutil.move
替换原始文件,如果你想要一个全新的文件只需将 NamedTemporaryFile
更改为使用 open
:
from tempfile import NamedTemporaryFile
from shutil import move
from itertools import groupby
import json
with open("in.txt") as f, NamedTemporaryFile("w", dir=".",delete=False) as out:
for k, v in groupby(f, key=lambda x: x.lstrip().startswith("{")):
if not k:
d = json.loads("{" + "".join(v))
v = d["metatag.eprints.citation"]
d["metatag.eprints.citation"] = v[:v.find(")")+1]
json.dump(d, out)
out.write("\n")
move(out.name,"in.txt")
in.txt之前:
{
"url":"http://example.com/10618/",
"metatag.eprints.publication":"Journal of Corporate Real Estate",
"metatag.eprints.title":"Corporate Real Estate Strategy",
"metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006) Corporate Real Estate Strategy. Journal of Corporate Real Estate"}
{
"url":"http://example.com/23552/",
"metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
"metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
"metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012) Long-term survival from coronary endarterectomies in coronary artery disease. European Journal of Cardio-Thoracic Surgery"}
in.txt 之后:
{"url": "http://example.com/10618/", "metatag.eprints.publication": "Journal of Corporate Real Estate", "metatag.eprints.citation": "Adair, P, McGrogan, WS, and Webb, JR (2006)", "metatag.eprints.title": "Corporate Real Estate Strategy"}
{"url": "http://example.com/23552/", "metatag.eprints.publication": "European Journal of Cardio-Thoracic Surgery", "metatag.eprints.citation": "Aaron, P, Jones, K, Pallin, C, and Nash, R (2012)", "metatag.eprints.title": "Long-term survival from coronary endarterectomies in coronary artery disease"}
如果您以后必须编辑它,您可以简单地遍历该文件并 json.loads
每行获取一个字典,再次使用密钥更新并写入文件。每行一个会让你的生活更轻松。
如果您可以在日期之前有一个左括号,您可以使用正则表达式搜索特定的子字符串,括号之间的 4 位数字:
r = re.compile("\(\d{4}\)")
for k, v in groupby(f, key=lambda x: x.lstrip().startswith("{")):
if not k:
d = json.loads("{" + "".join(v))
v = d["metatag.eprints.citation"]
d["metatag.eprints.citation"] = v[:next(r.finditer(v)).end()]
json.dump(d, out)
out.write("\n")
如果您得到一个空文件,那么您的数据实际上必须是每行一个字典,因此只需遍历文件对象并应用相同的逻辑即可:
with open("in.txt") as f, NamedTemporaryFile("w", dir=".",delete=False) as out:
for line in f:
d = json.loads(line)
v = d["metatag.eprints.citation"]
d["metatag.eprints.citation"] = v[:v.find(")")+1]
json.dump(d, out)
out.write("\n")
move(out.name,"in.txt")
jq '.["metatag.eprints.citation"] |= match(".*?\\)").string // .'
需要 jq 1.5。这样做是将 metatag.eprints.citation
的值设置为将自身与正则表达式 .*?\)
匹配的结果,这将匹配第一个右括号之前的所有内容。如果由于某种原因没有右括号,我们使用替代运算符 //
将值设置回原来的值。