如何更正由 ocr 引擎返回时包含无关引号的不正确 json？

Question

我们的 OCR 引擎 returns 结果为 json 数据： {"WordText":"\"*EET","Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0}

请注意，"WordText" 的值在反斜杠后包含一个双引号。当我用 json.dumps 处理它时，它得到一个 "Expecting delimiter" 错误。当 OCR 引擎在文本中遇到双引号时，会产生大量此类错误。好像没有什么方法可以修改OCR的输出，所以我需要写post处理代码来纠正这些错误。

我很乐意消除任何不紧跟在冒号之后或逗号之前的双引号，但不知道如何在 python 或正则表达式中有效地做到这一点。

有人有建议或工具可以解决此类 json 问题吗？

Answer 1

这对额外的转义有帮助吗...

Dump to JSON adds additional double quotes and escaping of quotes

这可能并不完美（我觉得使用两个正则表达式模式有点粗糙）但是对于给定的 JSON...

{"WordText":"\"*EET", "Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0},
{"WordText":""4512","Left":1.0,"Top":94.0,"Height":7.0,"Width":24.0},
{"WordText":"IV"L","Left":98.0,"Top":135.0,"Height":6.0,"Width":13.0}

此代码...

import pandas as pd
import re

pattern1 = re.compile(r'(?i)(\"\"|\"\\")') # replace with: "
pattern2 = re.compile(r'(?i)(\w)(\")(\w)') # replace with: 

data = '''
[{"WordText":"\"*EET", "Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0},
{"WordText":""4512","Left":1.0,"Top":94.0,"Height":7.0,"Width":24.0},
{"WordText":"IV"L","Left":98.0,"Top":135.0,"Height":6.0,"Width":13.0}]
'''

data = pattern1.sub(r'"', data)
data = pattern2.sub(r'', data)

#load it into a pandas dataframe just to prove it is valid
df = pd.read_json(data)

print(df)

输出...

  WordText  Left  Top  Height  Width
0     *EET    88  153       7     21
1     4512     1   94       7     24
2      IVL    98  135       6     13

也许看看答案开头那个额外的转义 link 看看那里是否有问题。这也可能有用...

**

更新：

**

这是新代码，其中包含损坏的示例 JSON 已通过两个正则表达式模式修复。我没有你的 JSON 但它表明正则表达式应该有助于解决目前所描述的损坏问题。我已经评论了代码以帮助解释它

代码：

import pandas as pd
import re

# compile a pattern to match "\"text" OR ""text" which needs replacing with a single doublequote
pattern1 = re.compile(r'(?i)(\"\"|\"\\")')

# compile a second pattern to match "te"xt" which needs to be replacing with nothing/blank/just remove
pattern2 = re.compile(r'(?i)\b(\")\b')

# if this was the input (good_data) it would work without any clean up
good_data = '''
{"Sub_ID":["1","2","3","4","5","6","7","8" ],
        "Name":["Erik", "Daniel", "Michael", "Sven",
                "Gary", "Carol","Lisa", "Elisabeth" ],
        "Salary":["723.3", "515.2", "621", "731", 
                  "844.15","558", "642.8", "732.5" ],
        "StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
                     "6/11/2013", "3/27/2011","5/21/2012", 
                     "7/30/2013", "6/17/2014"],
        "Department":[ "IT", "Management", "IT", "HR", 
                      "Finance", "IT", "Management", "IT"],
        "Sex":[ "M", "M", "M", 
              "M", "M", "F", "F", "F"]}
'''

# copied good_data and corrupted it with "\"Erik", ""Gary", and "Mana"gement"
bad_data = '''
{"Sub_ID":["1","2","3","4","5","6","7","8" ],
        "Name":["\"Erik", "Daniel", "Michael", "Sven",
                ""Gary", "Carol","Lisa", "Elisabeth" ],
        "Salary":["723.3", "515.2", "621", "731", 
                  "844.15","558", "642.8", "732.5" ],
        "StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
                     "6/11/2013", "3/27/2011","5/21/2012", 
                     "7/30/2013", "6/17/2014"],
        "Department":[ "IT", "Management", "IT", "HR", 
                      "Finance", "IT", "Mana"gement", "IT"],
        "Sex":[ "M", "M", "M", 
              "M", "M", "F", "F", "F"]}
'''

# run the bad_data through the find and replace for the two patterns
# first one finds such mistakes as "\"text" OR ""text" and replaces with a single doublequote
bad_data = pattern1.sub(r'"', bad_data)

# second pattern finds a doublequote on its own in the middle of a word like "te"xt" and removes it
bad_data = pattern2.sub(r'', bad_data)

# read the fixed bad_data into a pandas dataframe to check it's valid
df = pd.read_json(bad_data)

# print out the df
print(df)

输出：

   Sub_ID       Name  Salary   StartDate  Department Sex
0       1       Erik  723.30    1/1/2011          IT   M
1       2     Daniel  515.20   7/23/2013  Management   M
2       3    Michael  621.00  12/15/2011          IT   M
3       4       Sven  731.00   6/11/2013          HR   M
4       5       Gary  844.15   3/27/2011     Finance   M
5       6      Carol  558.00   5/21/2012          IT   F
6       7       Lisa  642.80   7/30/2013  Management   F
7       8  Elisabeth  732.50   6/17/2014          IT   F

如果您注释掉正则表达式替换行...

bad_data = pattern1.sub(r'"', bad_data)
bad_data = pattern2.sub(r'', bad_data)

...并让 pandas 读取 JSON 错误...

ValueError: Unexpected character found when decoding array value (2)

...这是预期的。

如何更正由 ocr 引擎返回时包含无关引号的不正确 json？

How to correct improper json containing extraneous quote marks when returned by ocr engine?

python

regex

ocr

json

更新：