如何在大型 json 文件中搜索和复制给定 ID 的项目

Question

我有两个大文件：

一个是有很多ID的文本文件：每行一个ID；
另一个是 6+ GB json 文件，包含许多项目。

我需要在 json 文件的特定字段中搜索这些 ID，并复制它所指的整个项目以供以后分析（创建新文件）。

我举个例子：

IDs.txt

    unique_id_1
    unique_id_2
    ...

schema.json

[
    {
        "id": "unique_id_1",
        "name": "",
        "text": "",
        "date": "",
    },
    {
        "id": "unique_id_aaa",
        "name": "",
        "text": "",
        "date": "",
    },
    {
        "id": "unique_id_2",
        "name": "",
        "text": "",
        "date": "",
    },
    ...
]

我正在使用 Python - Pandas 进行这些分析，但由于文件的尺寸过大，我遇到了麻烦。做这件事的最好方法是什么？我也可以考虑使用其他软件/语言

Answer 1

我实施了我的第二个建议：这仅在模式是平面的情况下有效（JSON 文件中没有嵌套对象）。我也没有检查如果 JSON 文件中的值是一个字典会发生什么，但可能会更仔细地处理，因为我目前在一行中检查 } 以确定对象是否是结束了

您仍然需要加载整个 IDs 文件，您需要以某种方式检查是否需要该对象。

如果 useful_objects 列表变得太大，您可以在解析文件时轻松地定期保存它。

import json
from pathlib import Path
import re
from typing import Dict

schema_name = "schema.json"
schema_path = Path(schema_name)
ids_name = "IDs.txt"
ids_path = Path(ids_name)

# read the ids
useful_ids = set()
with ids_path.open() as id_f:
    for line in id_f:
        id_ = line.strip()
        useful_ids.add(id_)
print(useful_ids)

useful_objects = []
temp: Dict[str, str] = {}
was_useful = False

with schema_path.open() as sc_f:

    for line in sc_f:
        # remove start/end whitespace
        line = line.strip()
        print(f"Parsing line {line}")

        # an object is ending
        if line[0] == "}":
            # add it
            if was_useful:
                useful_objects.append(temp)
            # reset the usefulness for the next object
            was_useful = False
            # reset the temp object
            temp = {}

        # parse the line
        match = re.match(r'"(.*?)": "(.*)"', line)

        # if this did not match, skip the line
        if match is None:
            continue

        # extract the data from the regex match
        key = match.group(1)
        value = match.group(2)
        print(f"\tMatched: {key} {value}")

        # build the temp object incrementally
        temp[key] = value

        # check if this object is useful
        if key == "id" and value in useful_ids:
            was_useful = True

useful_json = json.dumps(useful_objects, indent=4)
print(useful_json)

同样，不是很优雅也不是很健壮，但只要您了解其局限性，它就可以完成工作。

干杯！

如何在大型 json 文件中搜索和复制给定 ID 的项目

How to search and copy an item given the ID in a large json file

python

json

bigdata

pandas