文件包含 \u00c2\u00a0，转换为字符

Question

我有一个 JSON 文件，其中包含这样的文本

 .....wax, and voila!\u00c2\u00a0At the moment you can't use our ...

我的简单问题是如何将这些 \u 代码转换（而不是删除）为空格、撇号和 e.t.c...？

输入：一个文本文件.....wax, and voila!\u00c2\u00a0At the moment you can't use our ...

输出： .....wax, and voila!(converted to the line break)At the moment you can't use our ...

Python代码

def TEST():
        export= requests.get('https://sample.uk/', auth=('user', 'pass')).text

        with open("TEST.json",'w') as file:
            file.write(export.decode('utf8'))

我尝试过的：

使用 .json()
.encode().decode() 和 e.t.c.

编辑 1

当我将此文件上传到 BigQuery 时，我有 - Â 符号

更大的样本：

{
    "xxxx1": "...You don\u2019t nee...",
    "xxxx2": "...Gu\u00e9rer...",
    "xxxx3": "...boost.\u00a0Sit back an....",
    "xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"",
    "xxxx5": "\u00a0\n\u00a0",
    "xxxx6": "It was Christmas Eve babe\u2026",
    "xxxx7": "It\u2019s xxx xxx\u2026"
}

Python代码：

import json
import re
import codecs


def load():
    epos_export = r'{"xxxx1": "...You don\u2019t nee...","xxxx2": "...Gu\u00e9rer...","xxxx3": "...boost.\u00a0Sit back an....","xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"","xxxx5": "\u00a0\n\u00a0","xxxx6": "It was Christmas Eve babe\u2026","xxxx7": "It\u2019s xxx xxx\u2026"}'
    x = json.loads(re.sub(r"(?i)(?:\u00[0-9a-f]{2})+", unmangle_utf8, epos_export))

    with open("TEST.json", "w") as file:
        json.dump(x,file)

def unmangle_utf8(match):
    escaped = match.group(0)                   # '\u00e2\u0082\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'

    try:
        return buffer.decode('utf8')           # '€'
    except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)



if __name__ == '__main__':
    load()

Answer 1

hacky 方法是删除编码的外层：

import re
# Assume export is a bytes-like object
export = re.sub(b'\\u00([89a-f][0-9a-f])', lambda m: bytes.fromhex(m.group(1).decode()), export, flags=re.IGNORECASE)

这与转义的 UTF-8 字节匹配并用实际的 UTF-8 字节替换它们。将生成的 bytes-like 对象写入磁盘（无需进一步解码！）应该会生成有效的 UTF-8 JSON 文件。

当然，如果文件包含 UTF-8 范围内的真正转义 unicode 字符，这将中断，例如 \u00e9 用于重音 "e"。

Answer 2

我制作了这个粗略的 UTF-8 unmangler，它似乎可以解决您的 messed-up 编码问题：

import codecs
import re
import json

def unmangle_utf8(match):
    escaped = match.group(0)                   # '\u00e2\u0082\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'

    try:
        return buffer.decode('utf8')           # '€'
    except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)

用法：

broken_json = '{"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can\'t use our \u00e2\u0082\u00ac ..."}'
print("Broken JSON\n", broken_json)

converted = re.sub(r"(?i)(?:\u00[0-9a-f]{2})+", unmangle_utf8, broken_json)
print("Fixed JSON\n", converted)

data = json.loads(converted)
print("Parsed data\n", data)
print("Single value\n", data['some_key'])

它使用正则表达式从您的字符串中提取十六进制序列，将它们转换为单独的字节并将它们解码为 UTF-8。

对于上面的示例字符串（我已经包含了 3 字节字符 € 作为测试）这会打印：

Broken JSON
 {"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can't use our \u00e2\u0082\u00ac ..."}
Fixed JSON
 {"some_key": "... ’ wax, and voila!  At the moment you can't use our € ..."}
Parsed data
 {'some_key': "... ’ wax, and voila!\xa0\xa0At the moment you can't use our € ..."}
Single value
 ... ’ wax, and voila!  At the moment you can't use our € ...

"Parsed data"中的\xa0是Python向控制台输出dicts的方式造成的，它仍然是实际的non-breakingspace。

Answer 3

当您尝试将其写入名为 TEST.json 的文件中时，我会假设该字符串是更大的 json 字符串的一部分。

让我举个完整的例子：

js = '''{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}'''
print(js)

{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}

我会先用 json:

加载它

x = json.loads(js)
print(x)

{'a': "and voila!Â\xa0At the moment you can't use our"}

好的，这现在看起来像是被错误解码为 Latin1 的 utf-8 字符串。让我们做反向操作：

x['a'] = x['a'].encode('latin1').decode('utf8')
print(x)
print(x['a'])

{'a': "and voila!\xa0At the moment you can't use our"}
and voila! At the moment you can't use our

好的，现在可以了，我们可以将其转换回正确的 json 字符串：

print(json.dumps(x))

{"a": "and voila!\u00a0At the moment you can\'t use our"}

表示正确编码 NO-BREAK SPACE (U+00A0)

TL/DR：你应该做的是：

# load the string as json:
js = json.loads(request)

# identify the string values in the json - you probably know how but I don't...
...

# convert the strings:
js[...] = js[...].encode('latin1').decode('utf8')

# convert back to a json string
request = json.dumps(js)

文件包含 \u00c2\u00a0，转换为字符

File contain \u00c2\u00a0, convert to characters

python

encode

decode

python-2.7

python-3.x