简单命令行 JSON 等效于 Zeppelin 笔记本的 nbstripout 的工具

Question

一些背景

如果预计输出变化很大，版本控制笔记本可能会变得非常低效。我使用 nbstripout 在我的 Jupyter 笔记本上解决了这个问题，但到目前为止我还没有找到 Zeppelin 笔记本的替代品。

因为nbstripout使用nbformat解析ipynb个文件，要让它支持Zeppelin并不是一个容易的补丁。另一方面，目标并不复杂：只需清空所有 "msg": "...".

目标

给定一个 JSON 文件，清空所有 'paragraphs.result.msg' 个字段。

示例（模式）：

{"paragraps": [{"result": {"msg": "Very long output..."}}]}

Answer 1

Git 过滤器

最好的解决方案（感谢@steven-penny）是运行这个：

git config filter.znbstripout.clean "jq '.paragraphs[].result.msg = \"\"'"

这将设置一个名为 znbstripout 的过滤器，它会调用 jq 工具。然后，在您的 .gitattributes 文件中，您可以输入：

*.json filter=znbstripout

Python 脚本（可与 Git 挂钩一起使用）

以下可用作 git 挂钩：

#!/usr/bin/env python3

from glob import glob
import json

files = glob('**/note.json', recursive=True)
for file in files:
    with open(file, 'r') as fp:
        nb = json.load(fp)
    for p in nb['paragraphs']:
        if 'result' in p:
            p['result']['msg'] = ""
    with open(file, 'w') as fp:
        json.dump(nb, fp, sort_keys=True, indent=2)

Answer 2

JQ可以做到：

jq .paragraphs[].result.msg file

http://stedolan.github.io/jq

Answer 3

在下面的 (1) 和 (2) 中，我假设传入的 JSON 看起来像这样：

{
  "paragraphs": [
    {
      "result": {
        "msg": "msg1"
      }
    },
    {
      "result": {
        "msg": "msg2"
      }
    }
  ]
}

1。将 .result.msg 值设置为“”

.paragraphs[].result.msg = ""

2。要完全删除 .result.msg 字段：

del(.paragraphs[].result.msg)

3。要删除所有对象中的 "msg" 字段，无论它们出现在哪里：

walk(if type == "object" then del(.msg) else . end)

(如果你的jq没有walk,google:jq faq walk)

4。要删除出现在 .paragraphs 数组中的 .result 对象中的 "msg" 字段：

 walk(if type == "object" and (.paragraphs|type) == "array"
      then del(.paragraphs[].result?.msg?) else . end)

简单命令行 JSON 等效于 Zeppelin 笔记本的 nbstripout 的工具

Simple command line JSON tool equivalent of nbstripout for Zeppelin notebooks

bash

json

jq

apache-zeppelin