如何更正由 ocr 引擎返回时包含无关引号的不正确 json?

How to correct improper json containing extraneous quote marks when returned by ocr engine?

我们的 OCR 引擎 returns 结果为 json 数据: {"WordText":"\"*EET","Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0}

请注意,"WordText" 的值在反斜杠后包含一个双引号。当我用 json.dumps 处理它时,它得到一个 "Expecting delimiter" 错误。当 OCR 引擎在文本中遇到双引号时,会产生大量此类错误。好像没有什么方法可以修改OCR的输出,所以我需要写post处理代码来纠正这些错误。

我很乐意消除任何不紧跟在冒号之后或逗号之前的双引号,但不知道如何在 python 或正则表达式中有效地做到这一点。

有人有建议或工具可以解决此类 json 问题吗?

这对额外的转义有帮助吗...

Dump to JSON adds additional double quotes and escaping of quotes

这可能并不完美(我觉得使用两个正则表达式模式有点粗糙)但是对于给定的 JSON...

{"WordText":"\"*EET", "Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0},
{"WordText":""4512","Left":1.0,"Top":94.0,"Height":7.0,"Width":24.0},
{"WordText":"IV"L","Left":98.0,"Top":135.0,"Height":6.0,"Width":13.0}

此代码...

import pandas as pd
import re

pattern1 = re.compile(r'(?i)(\"\"|\"\\")') # replace with: "
pattern2 = re.compile(r'(?i)(\w)(\")(\w)') # replace with: 

data = '''
[{"WordText":"\"*EET", "Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0},
{"WordText":""4512","Left":1.0,"Top":94.0,"Height":7.0,"Width":24.0},
{"WordText":"IV"L","Left":98.0,"Top":135.0,"Height":6.0,"Width":13.0}]
'''

data = pattern1.sub(r'"', data)
data = pattern2.sub(r'', data)

#load it into a pandas dataframe just to prove it is valid
df = pd.read_json(data)

print(df)

输出...

  WordText  Left  Top  Height  Width
0     *EET    88  153       7     21
1     4512     1   94       7     24
2      IVL    98  135       6     13

也许看看答案开头那个额外的转义 link 看看那里是否有问题。这也可能有用...

**

更新:

**

这是新代码,其中包含损坏的示例 JSON 已通过两个正则表达式模式修复。我没有你的 JSON 但它表明正则表达式应该有助于解决目前所描述的损坏问题。我已经评论了代码以帮助解释它

代码:

import pandas as pd
import re

# compile a pattern to match "\"text" OR ""text" which needs replacing with a single doublequote
pattern1 = re.compile(r'(?i)(\"\"|\"\\")')

# compile a second pattern to match "te"xt" which needs to be replacing with nothing/blank/just remove
pattern2 = re.compile(r'(?i)\b(\")\b')

# if this was the input (good_data) it would work without any clean up
good_data = '''
{"Sub_ID":["1","2","3","4","5","6","7","8" ],
        "Name":["Erik", "Daniel", "Michael", "Sven",
                "Gary", "Carol","Lisa", "Elisabeth" ],
        "Salary":["723.3", "515.2", "621", "731", 
                  "844.15","558", "642.8", "732.5" ],
        "StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
                     "6/11/2013", "3/27/2011","5/21/2012", 
                     "7/30/2013", "6/17/2014"],
        "Department":[ "IT", "Management", "IT", "HR", 
                      "Finance", "IT", "Management", "IT"],
        "Sex":[ "M", "M", "M", 
              "M", "M", "F", "F", "F"]}
'''

# copied good_data and corrupted it with "\"Erik", ""Gary", and "Mana"gement"
bad_data = '''
{"Sub_ID":["1","2","3","4","5","6","7","8" ],
        "Name":["\"Erik", "Daniel", "Michael", "Sven",
                ""Gary", "Carol","Lisa", "Elisabeth" ],
        "Salary":["723.3", "515.2", "621", "731", 
                  "844.15","558", "642.8", "732.5" ],
        "StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
                     "6/11/2013", "3/27/2011","5/21/2012", 
                     "7/30/2013", "6/17/2014"],
        "Department":[ "IT", "Management", "IT", "HR", 
                      "Finance", "IT", "Mana"gement", "IT"],
        "Sex":[ "M", "M", "M", 
              "M", "M", "F", "F", "F"]}
'''

# run the bad_data through the find and replace for the two patterns
# first one finds such mistakes as "\"text" OR ""text" and replaces with a single doublequote
bad_data = pattern1.sub(r'"', bad_data)

# second pattern finds a doublequote on its own in the middle of a word like "te"xt" and removes it
bad_data = pattern2.sub(r'', bad_data)

# read the fixed bad_data into a pandas dataframe to check it's valid
df = pd.read_json(bad_data)

# print out the df
print(df)

输出:

   Sub_ID       Name  Salary   StartDate  Department Sex
0       1       Erik  723.30    1/1/2011          IT   M
1       2     Daniel  515.20   7/23/2013  Management   M
2       3    Michael  621.00  12/15/2011          IT   M
3       4       Sven  731.00   6/11/2013          HR   M
4       5       Gary  844.15   3/27/2011     Finance   M
5       6      Carol  558.00   5/21/2012          IT   F
6       7       Lisa  642.80   7/30/2013  Management   F
7       8  Elisabeth  732.50   6/17/2014          IT   F

如果您注释掉正则表达式替换行...

bad_data = pattern1.sub(r'"', bad_data)
bad_data = pattern2.sub(r'', bad_data)

...并让 pandas 读取 JSON 错误...

ValueError: Unexpected character found when decoding array value (2)

...这是预期的。