jq 或 python 脚本删除 json 字段中日期后的文本

jq or python script to delete text after date in json field

我有一个包含数百个条目的 json 文件,例如:

{
    "url":"http://example.com/10618/",
    "metatag.eprints.publication":"Journal of Corporate Real Estate",
    "metatag.eprints.title":"Corporate Real Estate Strategy",
    "metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006) Corporate Real Estate Strategy. Journal of Corporate Real Estate"}
{
    "url":"http://example.com/23552/",
    "metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
    "metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
    "metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012) Long-term survival from coronary endarterectomies in coronary artery disease. European Journal of Cardio-Thoracic Surgery"}

任何人都可以帮助编写一个 jq 或 python 脚本,对于每个块,它会更改 "metatag.eprints.citation" 以便删除日期之后的所有文本?

所以上面的块将变成:

{
    "url":"http://example.com/10618/",
    "metatag.eprints.publication":"Journal of Corporate Real Estate",
    "metatag.eprints.title":"Corporate Real Estate Strategy",
    "metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006)"}
{
    "url":"http://example.com/23552/",
    "metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
    "metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
    "metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012)"}

一旦你的格式像你的问题一样,你可以使用 itertools.groupby 按左括号分组,用 str.join 加入行并使用 json.loads 得到一个字典,然后这只是按键访问并将更新的数据写入临时文件的问题。最后使用 shutil.move 替换原始文件,如果你想要一个全新的文件只需将 NamedTemporaryFile 更改为使用 open:

from tempfile import NamedTemporaryFile
from shutil import move
from itertools import groupby

import json

with open("in.txt") as f, NamedTemporaryFile("w", dir=".",delete=False) as out:
    for k, v in groupby(f, key=lambda x: x.lstrip().startswith("{")):
        if not k:
            d = json.loads("{" + "".join(v))
            v = d["metatag.eprints.citation"]
            d["metatag.eprints.citation"] = v[:v.find(")")+1]
            json.dump(d, out)
            out.write("\n")
move(out.name,"in.txt")

in.txt之前:

{
    "url":"http://example.com/10618/",
    "metatag.eprints.publication":"Journal of Corporate Real Estate",
    "metatag.eprints.title":"Corporate Real Estate Strategy",
    "metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006) Corporate Real Estate Strategy. Journal of Corporate Real Estate"}
{
    "url":"http://example.com/23552/",
    "metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
    "metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
    "metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012) Long-term survival from coronary endarterectomies in coronary artery disease. European Journal of Cardio-Thoracic Surgery"}

in.txt 之后:

{"url": "http://example.com/10618/", "metatag.eprints.publication": "Journal of Corporate Real Estate", "metatag.eprints.citation": "Adair, P, McGrogan, WS, and Webb, JR (2006)", "metatag.eprints.title": "Corporate Real Estate Strategy"}
{"url": "http://example.com/23552/", "metatag.eprints.publication": "European Journal of Cardio-Thoracic Surgery", "metatag.eprints.citation": "Aaron, P, Jones, K, Pallin, C, and Nash, R (2012)", "metatag.eprints.title": "Long-term survival from coronary endarterectomies in coronary artery disease"}

如果您以后必须编辑它,您可以简单地遍历该文件并 json.loads 每行获取一个字典,再次使用密钥更新并写入文件。每行一个会让你的生活更轻松。

如果您可以在日期之前有一个左括号,您可以使用正则表达式搜索特定的子字符串,括号之间的 4 位数字:

r = re.compile("\(\d{4}\)")
for k, v in groupby(f, key=lambda x: x.lstrip().startswith("{")):
    if not k:
        d = json.loads("{" + "".join(v))
        v = d["metatag.eprints.citation"]
        d["metatag.eprints.citation"] = v[:next(r.finditer(v)).end()]
        json.dump(d, out)
        out.write("\n")

如果您得到一个空文件,那么您的数据实际上必须是每行一个字典,因此只需遍历文件对象并应用相同的逻辑即可:

with open("in.txt") as f, NamedTemporaryFile("w", dir=".",delete=False) as out:
    for line in f:
            d = json.loads(line)
            v = d["metatag.eprints.citation"]
            d["metatag.eprints.citation"] = v[:v.find(")")+1]
            json.dump(d, out)
            out.write("\n")
move(out.name,"in.txt")

jq '.["metatag.eprints.citation"] |= match(".*?\\)").string // .'

需要 jq 1.5。这样做是将 metatag.eprints.citation 的值设置为将自身与正则表达式 .*?\) 匹配的结果,这将匹配第一个右括号之前的所有内容。如果由于某种原因没有右括号,我们使用替代运算符 // 将值设置回原来的值。