文件包含 \u00c2\u00a0,转换为字符
File contain \u00c2\u00a0, convert to characters
我有一个 JSON 文件,其中包含这样的文本
.....wax, and voila!\u00c2\u00a0At the moment you can't use our ...
我的简单问题是如何将这些 \u 代码转换(而不是删除)为空格、撇号和 e.t.c...?
输入:一个文本文件.....wax, and voila!\u00c2\u00a0At the moment you can't use our ...
输出: .....wax, and voila!(converted to the line break)At the moment you can't use our ...
Python代码
def TEST():
export= requests.get('https://sample.uk/', auth=('user', 'pass')).text
with open("TEST.json",'w') as file:
file.write(export.decode('utf8'))
我尝试过的:
- 使用 .json()
- .encode().decode() 和 e.t.c.
的任何不同组合方式
编辑 1
当我将此文件上传到 BigQuery 时,我有 - Â
符号
更大的样本:
{
"xxxx1": "...You don\u2019t nee...",
"xxxx2": "...Gu\u00e9rer...",
"xxxx3": "...boost.\u00a0Sit back an....",
"xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"",
"xxxx5": "\u00a0\n\u00a0",
"xxxx6": "It was Christmas Eve babe\u2026",
"xxxx7": "It\u2019s xxx xxx\u2026"
}
Python代码:
import json
import re
import codecs
def load():
epos_export = r'{"xxxx1": "...You don\u2019t nee...","xxxx2": "...Gu\u00e9rer...","xxxx3": "...boost.\u00a0Sit back an....","xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"","xxxx5": "\u00a0\n\u00a0","xxxx6": "It was Christmas Eve babe\u2026","xxxx7": "It\u2019s xxx xxx\u2026"}'
x = json.loads(re.sub(r"(?i)(?:\u00[0-9a-f]{2})+", unmangle_utf8, epos_export))
with open("TEST.json", "w") as file:
json.dump(x,file)
def unmangle_utf8(match):
escaped = match.group(0) # '\u00e2\u0082\u00ac'
hexstr = escaped.replace(r'\u00', '') # 'e282ac'
buffer = codecs.decode(hexstr, "hex") # b'\xe2\x82\xac'
try:
return buffer.decode('utf8') # '€'
except UnicodeDecodeError:
print("Could not decode buffer: %s" % buffer)
if __name__ == '__main__':
load()
hacky 方法是删除编码的外层:
import re
# Assume export is a bytes-like object
export = re.sub(b'\\u00([89a-f][0-9a-f])', lambda m: bytes.fromhex(m.group(1).decode()), export, flags=re.IGNORECASE)
这与转义的 UTF-8 字节匹配并用实际的 UTF-8 字节替换它们。将生成的 bytes-like 对象写入磁盘(无需进一步解码!)应该会生成有效的 UTF-8 JSON 文件。
当然,如果文件包含 UTF-8 范围内的真正转义 unicode 字符,这将中断,例如 \u00e9
用于重音 "e"。
我制作了这个粗略的 UTF-8 unmangler,它似乎可以解决您的 messed-up 编码问题:
import codecs
import re
import json
def unmangle_utf8(match):
escaped = match.group(0) # '\u00e2\u0082\u00ac'
hexstr = escaped.replace(r'\u00', '') # 'e282ac'
buffer = codecs.decode(hexstr, "hex") # b'\xe2\x82\xac'
try:
return buffer.decode('utf8') # '€'
except UnicodeDecodeError:
print("Could not decode buffer: %s" % buffer)
用法:
broken_json = '{"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can\'t use our \u00e2\u0082\u00ac ..."}'
print("Broken JSON\n", broken_json)
converted = re.sub(r"(?i)(?:\u00[0-9a-f]{2})+", unmangle_utf8, broken_json)
print("Fixed JSON\n", converted)
data = json.loads(converted)
print("Parsed data\n", data)
print("Single value\n", data['some_key'])
它使用正则表达式从您的字符串中提取十六进制序列,将它们转换为单独的字节并将它们解码为 UTF-8。
对于上面的示例字符串(我已经包含了 3 字节字符 €
作为测试)这会打印:
Broken JSON
{"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can't use our \u00e2\u0082\u00ac ..."}
Fixed JSON
{"some_key": "... ’ wax, and voila! At the moment you can't use our € ..."}
Parsed data
{'some_key': "... ’ wax, and voila!\xa0\xa0At the moment you can't use our € ..."}
Single value
... ’ wax, and voila! At the moment you can't use our € ...
"Parsed data"中的\xa0
是Python向控制台输出dicts的方式造成的,它仍然是实际的non-breakingspace。
当您尝试将其写入名为 TEST.json
的文件中时,我会假设该字符串是更大的 json 字符串的一部分。
让我举个完整的例子:
js = '''{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}'''
print(js)
{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}
我会先用 json:
加载它
x = json.loads(js)
print(x)
{'a': "and voila!Â\xa0At the moment you can't use our"}
好的,这现在看起来像是被错误解码为 Latin1 的 utf-8 字符串。让我们做反向操作:
x['a'] = x['a'].encode('latin1').decode('utf8')
print(x)
print(x['a'])
{'a': "and voila!\xa0At the moment you can't use our"}
and voila! At the moment you can't use our
好的,现在可以了,我们可以将其转换回正确的 json 字符串:
print(json.dumps(x))
{"a": "and voila!\u00a0At the moment you can\'t use our"}
表示正确编码 NO-BREAK SPACE (U+00A0)
TL/DR:你应该做的是:
# load the string as json:
js = json.loads(request)
# identify the string values in the json - you probably know how but I don't...
...
# convert the strings:
js[...] = js[...].encode('latin1').decode('utf8')
# convert back to a json string
request = json.dumps(js)
我有一个 JSON 文件,其中包含这样的文本
.....wax, and voila!\u00c2\u00a0At the moment you can't use our ...
我的简单问题是如何将这些 \u 代码转换(而不是删除)为空格、撇号和 e.t.c...?
输入:一个文本文件.....wax, and voila!\u00c2\u00a0At the moment you can't use our ...
输出: .....wax, and voila!(converted to the line break)At the moment you can't use our ...
Python代码
def TEST():
export= requests.get('https://sample.uk/', auth=('user', 'pass')).text
with open("TEST.json",'w') as file:
file.write(export.decode('utf8'))
我尝试过的:
- 使用 .json()
- .encode().decode() 和 e.t.c. 的任何不同组合方式
编辑 1
当我将此文件上传到 BigQuery 时,我有 - Â
符号
更大的样本:
{
"xxxx1": "...You don\u2019t nee...",
"xxxx2": "...Gu\u00e9rer...",
"xxxx3": "...boost.\u00a0Sit back an....",
"xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"",
"xxxx5": "\u00a0\n\u00a0",
"xxxx6": "It was Christmas Eve babe\u2026",
"xxxx7": "It\u2019s xxx xxx\u2026"
}
Python代码:
import json
import re
import codecs
def load():
epos_export = r'{"xxxx1": "...You don\u2019t nee...","xxxx2": "...Gu\u00e9rer...","xxxx3": "...boost.\u00a0Sit back an....","xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"","xxxx5": "\u00a0\n\u00a0","xxxx6": "It was Christmas Eve babe\u2026","xxxx7": "It\u2019s xxx xxx\u2026"}'
x = json.loads(re.sub(r"(?i)(?:\u00[0-9a-f]{2})+", unmangle_utf8, epos_export))
with open("TEST.json", "w") as file:
json.dump(x,file)
def unmangle_utf8(match):
escaped = match.group(0) # '\u00e2\u0082\u00ac'
hexstr = escaped.replace(r'\u00', '') # 'e282ac'
buffer = codecs.decode(hexstr, "hex") # b'\xe2\x82\xac'
try:
return buffer.decode('utf8') # '€'
except UnicodeDecodeError:
print("Could not decode buffer: %s" % buffer)
if __name__ == '__main__':
load()
hacky 方法是删除编码的外层:
import re
# Assume export is a bytes-like object
export = re.sub(b'\\u00([89a-f][0-9a-f])', lambda m: bytes.fromhex(m.group(1).decode()), export, flags=re.IGNORECASE)
这与转义的 UTF-8 字节匹配并用实际的 UTF-8 字节替换它们。将生成的 bytes-like 对象写入磁盘(无需进一步解码!)应该会生成有效的 UTF-8 JSON 文件。
当然,如果文件包含 UTF-8 范围内的真正转义 unicode 字符,这将中断,例如 \u00e9
用于重音 "e"。
我制作了这个粗略的 UTF-8 unmangler,它似乎可以解决您的 messed-up 编码问题:
import codecs
import re
import json
def unmangle_utf8(match):
escaped = match.group(0) # '\u00e2\u0082\u00ac'
hexstr = escaped.replace(r'\u00', '') # 'e282ac'
buffer = codecs.decode(hexstr, "hex") # b'\xe2\x82\xac'
try:
return buffer.decode('utf8') # '€'
except UnicodeDecodeError:
print("Could not decode buffer: %s" % buffer)
用法:
broken_json = '{"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can\'t use our \u00e2\u0082\u00ac ..."}'
print("Broken JSON\n", broken_json)
converted = re.sub(r"(?i)(?:\u00[0-9a-f]{2})+", unmangle_utf8, broken_json)
print("Fixed JSON\n", converted)
data = json.loads(converted)
print("Parsed data\n", data)
print("Single value\n", data['some_key'])
它使用正则表达式从您的字符串中提取十六进制序列,将它们转换为单独的字节并将它们解码为 UTF-8。
对于上面的示例字符串(我已经包含了 3 字节字符 €
作为测试)这会打印:
Broken JSON {"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can't use our \u00e2\u0082\u00ac ..."} Fixed JSON {"some_key": "... ’ wax, and voila! At the moment you can't use our € ..."} Parsed data {'some_key': "... ’ wax, and voila!\xa0\xa0At the moment you can't use our € ..."} Single value ... ’ wax, and voila! At the moment you can't use our € ...
"Parsed data"中的\xa0
是Python向控制台输出dicts的方式造成的,它仍然是实际的non-breakingspace。
当您尝试将其写入名为 TEST.json
的文件中时,我会假设该字符串是更大的 json 字符串的一部分。
让我举个完整的例子:
js = '''{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}'''
print(js)
{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}
我会先用 json:
加载它x = json.loads(js)
print(x)
{'a': "and voila!Â\xa0At the moment you can't use our"}
好的,这现在看起来像是被错误解码为 Latin1 的 utf-8 字符串。让我们做反向操作:
x['a'] = x['a'].encode('latin1').decode('utf8')
print(x)
print(x['a'])
{'a': "and voila!\xa0At the moment you can't use our"}
and voila! At the moment you can't use our
好的,现在可以了,我们可以将其转换回正确的 json 字符串:
print(json.dumps(x))
{"a": "and voila!\u00a0At the moment you can\'t use our"}
表示正确编码 NO-BREAK SPACE (U+00A0)
TL/DR:你应该做的是:
# load the string as json:
js = json.loads(request)
# identify the string values in the json - you probably know how but I don't...
...
# convert the strings:
js[...] = js[...].encode('latin1').decode('utf8')
# convert back to a json string
request = json.dumps(js)