如何使用父兄弟姐妹和子字符串从 json 中提取文本？

Question

https://bern.korea.ac.kr/pubmed/32818866

$ jq -r '.[] | .denotations | .[] | select(.obj=="drug") | .span | [.begin, .end] | @tsv'

我可以使用上面的 jq 命令从上面的 URL 中提取以下信息。

377 387
562 579
584 602
659 676
681 699
919 936
941 959
1032    1049
1054    1072

但我真正需要的输出如下。

最后一列只是 text 从 begin+1 到 end 的子字符串（假设 text 中的字符串从 1 开始索引。

我不知道如何仅使用 jq 提取此信息，因为它涉及获取父同级元素和另一个父同级元素的子字符串。谁能告诉我如何提取这种格式的输出？谢谢。

32818866    377 387 silica gel
32818866    562 579 7-methoxycoumarin
32818866    584 602 8-prenylkaempferol
32818866    659 676 7-methoxycoumarin
32818866    681 699 8-prenylkaempferol
32818866    919 936 7-methoxycoumarin
32818866    941 959 8-prenylkaempferol
32818866    1032    1049    7-methoxycoumarin
32818866    1054    1072    8-prenylkaempferol

此处提供 json txt 以确保此消息的完整性。

[
  {
    "project": "BERN",
    "sourcedb": "PubMed",
    "sourceid": "32818866",
    "text": "Identification of two bitter components in Zanthoxylum bungeanum Maxim. and exploration of their bitter taste mechanism through receptor hTAS2R14. Bitterness is an inherent organoleptic characteristic affecting the flavor of Zanthoxylum bungeanum Maxim. In this study, the vital bitter components of Z. bungeanum were concentrated through solvent extraction, sensory analysis, silica gel chromatography, and thin-layer chromatographic techniques and subsequently identified by UPLC-Q-TOF-MS. Two components with the highest bitterness intensities (BIs), such as 7-methoxycoumarin and 8-prenylkaempferol were selected. The bitter taste perceived thresholds of 7-methoxycoumarin and 8-prenylkaempferol were 0.062 mmol/L and 0.022 mmol/L, respectively. Moreover, the correlation between the contents of the two bitter components and the BIs of Z. bungeanum were proved. The results of siRNA and flow cytometry showed that 7-methoxycoumarin and 8-prenylkaempferol could activate the bitter receptor hTAS2R14. The results concluded that 7-methoxycoumarin and 8-prenylkaempferol contribute to the bitter taste of Z. bungeanum.",
    "denotations": [
      {
        "id": [
          "NCBI:txid328401"
        ],
        "span": {
          "begin": 43,
          "end": 64
        },
        "obj": "species"
      },
      {
        "id": [
          "CUI-less"
        ],
        "span": {
          "begin": 128,
          "end": 145
        },
        "obj": "gene"
      },
      {
        "id": [
          "NCBI:txid328401"
        ],
        "span": {
          "begin": 225,
          "end": 246
        },
        "obj": "species"
      },
      {
        "id": [
          "NCBI:txid328401"
        ],
        "span": {
          "begin": 300,
          "end": 312
        },
        "obj": "species"
      },
      {
        "id": [
          "MESH:D058428",
          "BERN:315272203"
        ],
        "span": {
          "begin": 377,
          "end": 387
        },
        "obj": "drug"
      },
      {
        "id": [
          "CHEBI:5679",
          "BERN:4597103"
        ],
        "span": {
          "begin": 562,
          "end": 579
        },
        "obj": "drug"
      },
      {
        "id": [
          "MESH:C532177",
          "BERN:280529003"
        ],
        "span": {
          "begin": 584,
          "end": 602
        },
        "obj": "drug"
      },
      {
        "id": [
          "CHEBI:5679",
          "BERN:4597103"
        ],
        "span": {
          "begin": 659,
          "end": 676
        },
        "obj": "drug"
      },
      {
        "id": [
          "MESH:C532177",
          "BERN:280529003"
        ],
        "span": {
          "begin": 681,
          "end": 699
        },
        "obj": "drug"
      },
      {
        "id": [
          "NCBI:txid328401"
        ],
        "span": {
          "begin": 841,
          "end": 853
        },
        "obj": "species"
      },
      {
        "id": [
          "CHEBI:5679",
          "BERN:4597103"
        ],
        "span": {
          "begin": 919,
          "end": 936
        },
        "obj": "drug"
      },
      {
        "id": [
          "MESH:C532177",
          "BERN:280529003"
        ],
        "span": {
          "begin": 941,
          "end": 959
        },
        "obj": "drug"
      },
      {
        "id": [
          "CUI-less"
        ],
        "span": {
          "begin": 979,
          "end": 994
        },
        "obj": "gene"
      },
      {
        "id": [
          "CUI-less"
        ],
        "span": {
          "begin": 995,
          "end": 1003
        },
        "obj": "gene"
      },
      {
        "id": [
          "CHEBI:5679",
          "BERN:4597103"
        ],
        "span": {
          "begin": 1032,
          "end": 1049
        },
        "obj": "drug"
      },
      {
        "id": [
          "MESH:C532177",
          "BERN:280529003"
        ],
        "span": {
          "begin": 1054,
          "end": 1072
        },
        "obj": "drug"
      },
      {
        "id": [
          "NCBI:txid328401"
        ],
        "span": {
          "begin": 1107,
          "end": 1119
        },
        "obj": "species"
      }
    ],
    "timestamp": "Wed Oct 28 21:43:04 +0000 2020",
    "logits": {
      "disease": [],
      "gene": [
        [
          {
            "start": 128,
            "end": 145,
            "id": "CUI-less"
          },
          0.7066106796264648
        ],
        [
          {
            "start": 979,
            "end": 994,
            "id": "CUI-less"
          },
          0.9999749660491943
        ],
        [
          {
            "start": 995,
            "end": 1003,
            "id": "CUI-less"
          },
          0.9052715301513672
        ]
      ],
      "drug": [
        [
          {
            "start": 377,
            "end": 387,
            "id": "MESH:D058428\tBERN:315272203"
          },
          0.999982476234436
        ],
        [
          {
            "start": 562,
            "end": 579,
            "id": "CHEBI:5679\tBERN:4597103"
          },
          0.9999980926513672
        ],
        [
          {
            "start": 584,
            "end": 602,
            "id": "MESH:C532177\tBERN:280529003"
          },
          0.9999980926513672
        ],
        [
          {
            "start": 659,
            "end": 676,
            "id": "CHEBI:5679\tBERN:4597103"
          },
          0.9999980926513672
        ],
        [
          {
            "start": 681,
            "end": 699,
            "id": "MESH:C532177\tBERN:280529003"
          },
          0.9999980330467224
        ],
        [
          {
            "start": 919,
            "end": 936,
            "id": "CHEBI:5679\tBERN:4597103"
          },
          0.9999980926513672
        ],
        [
          {
            "start": 941,
            "end": 959,
            "id": "MESH:C532177\tBERN:280529003"
          },
          0.9999980926513672
        ],
        [
          {
            "start": 1032,
            "end": 1049,
            "id": "CHEBI:5679\tBERN:4597103"
          },
          0.9999980926513672
        ],
        [
          {
            "start": 1054,
            "end": 1072,
            "id": "MESH:C532177\tBERN:280529003"
          },
          0.9999980926513672
        ]
      ],
      "species": [
        [
          {
            "start": 43,
            "end": 64,
            "id": "NCBI:txid328401"
          },
          0.9999997615814209
        ],
        [
          {
            "start": 225,
            "end": 246,
            "id": "NCBI:txid328401"
          },
          0.9999998211860657
        ],
        [
          {
            "start": 300,
            "end": 312,
            "id": "NCBI:txid328401"
          },
          0.9999998211860657
        ],
        [
          {
            "start": 841,
            "end": 853,
            "id": "NCBI:txid328401"
          },
          0.9999998211860657
        ],
        [
          {
            "start": 1107,
            "end": 1119,
            "id": "NCBI:txid328401"
          },
          0.9999998211860657
        ]
      ]
    },
    "elapsed_time": {
      "tmtool": 0.991,
      "ner": 0.453,
      "normalization": 0.172,
      "total": 1.617
    }
  }
]

Answer 1

假设所需输出的第一列是“sourceid”，我们可以按如下方式调整您的解决方案：

.[]
| .sourceid as $id
| .text as $text
| .denotations[]
| select(.obj=="drug")
| .span
| [$id, .begin, .end, $text[.begin : .end] ]
| @tsv

如何使用父兄弟姐妹和子字符串从 json 中提取文本？

How to extract text from json with parent siblings and substrings?

json

substring

export-to-pdf

pubmed

jq