如何删除 Python 中两个分隔符之间的文本

Question

我正在尝试删除短语“segmentation”之后 [] 括号之间的所有文本：请参阅下面的文件片段以了解上下文。

 "annotations": [
        {
            "id": 1,
            "image_id": 1,
            "segmentation": [
                [
                    621.63,
                    1085.67,
                    621.63,
                    1344.71,
                    841.66,
                    1344.71,
                    841.66,
                    1085.67
                ]
            ],
            "iscrowd": 0,
            "bbox": [
                621.63,
                1085.67,
                220.02999999999997,
                259.03999999999996
            ],
            "area": 56996,
            "category_id": 1124044
        },
        {
            "id": 2,
            "image_id": 1,
            "segmentation": [
                [
                    887.62,
                    1355.7,
                    887.62,
                    1615.54,
                    1114.64,
                    1615.54,
                    1114.64,
                    1355.7
                ]
            ],
            "iscrowd": 0,
            "bbox": [
                887.62,
                1355.7,
                227.0200000000001,
                259.8399999999999
            ],
            "area": 58988,
            "category_id": 1124044
        },
        {
            "id": 3,
            "image_id": 1,
            "segmentation": [
                [
                    1157.61,
                    1411.84,
                    1157.61,
                    1661.63,
                    1404.89,
                    1661.63,
                    1404.89,
                    1411.84
                ]
            ],
            "iscrowd": 0,
            "bbox": [
                1157.61,
                1411.84,
                247.2800000000002,
                249.7900000000002
            ],
            "area": 61768,
            "category_id": 1124044
        },
        ........... and so on.....

我最终只是想在出现分词后删除方括号内的所有文字。换句话说，输出看起来像（对于第一个实例）：

"annotations": [
            {
                "id": 1,
                "image_id": 1,
                "segmentation": [],
                "iscrowd": 0,
                "bbox": [
                    621.63,
                    1085.67,
                    220.02999999999997,
                    259.03999999999996
                ],
                "area": 56996,
                "category_id": 1124044
            },

我试过使用下面的代码，但目前运气不太好。由于新行，我有什么地方出错了吗？

import re
f = open('samplfile.json')
text = f.read()
f.close()

clean = re.sub('"segmentation":(.*)\]', '', text)

print(clean)

f = open('cleanedfile.json', 'w')
f.write(clean)
f.close()

我很欣赏我对干净行中 [s 的确切定位可能不太正确，但这段代码目前没有删除任何内容。

Answer 1

Python 有一个内置的 json 模块用于解析和修改 JSON。正则表达式可能很脆弱，而且比它的价值更让人头疼。

您可以执行以下操作：

import json

with open('samplfile.json') as input_file, open('output.json', 'w') as output_file:
    data = json.load(input_file)
    for i in range(len(data['annotations'])):
        data['annotations'][i]['segmentation'] = []

    json.dump(data, output_file, indent=4)

那么，output.json包含：

{
    "annotations": [
        {
            "id": 1,
            "image_id": 1,
            "segmentation": [],
            "iscrowd": 0,
            "bbox": [
                621.63,
                1085.67,
                220.02999999999997,
                259.03999999999996
            ],
            "area": 56996,
            "category_id": 1124044
        },
        {
            "id": 2,
            "image_id": 1,
            "segmentation": [],
            "iscrowd": 0,
            "bbox": [
                887.62,
                1355.7,
                227.0200000000001,
                259.8399999999999
            ],
            "area": 58988,
            "category_id": 1124044
        },
        {
            "id": 3,
            "image_id": 1,
            "segmentation": [],
            "iscrowd": 0,
            "bbox": [
                1157.61,
                1411.84,
                247.2800000000002,
                249.7900000000002
            ],
            "area": 61768,
            "category_id": 1124044
        }
    ]
}

Answer 2

您的方法大部分是正确的，但是 Python regrex 不接受 \n 作为 .，要修复它，请在 [=19] 中添加 flags=re.DOTALL 作为参数=]().

顺便说一下，您可能需要在正则表达式中使用 \" 而不是 "。

如何删除 Python 中两个分隔符之间的文本

How to remove text between two delimiters in Python

python

text

python-re