用 python 中的字符替换所有 unicode 代码
Replace all unicode codes with characters in python
我有一个如下所示的文本文件:
- l\u00f6yt\u00e4\u00e4
但是所有 unicode 字符都需要用相应的字符替换,并且应该如下所示:
- löytää
问题是我不想自己替换所有 unicode 代码,自动执行此操作的最有效方法是什么?
我的代码现在看起来像这样,但它确实需要改进!(代码在 Python3)
import io
input = io.open("input.json", "r", encoding="utf-8")
output = io.open("output.txt", "w", encoding="utf-8")
with input, output:
# Read input file.
file = input.read()
file = file.replace("\u00e4", "ä")
# I think last line is the same as line below:
# file = file .replace("\u00e4", u"\u00e4")
file = file.replace("\u00c4", "Ä")
file = file.replace("\u00f6", "ö")
file = file.replace("\u00d6", "Ö")
.
.
.
# I cannot put all codes in unicode here manually!
.
.
.
# writing output file
output.write(file)
只需将 JSON 解码为 JSON,然后在不确保数据是 ASCII 安全的情况下写出一个新的 JSON 文档:
import json
with open("input.json", "r", encoding="utf-8") as input:
with open("output.txt", "w", encoding="utf-8") as output:
document = json.load(input)
json.dump(document, output, ensure_ascii=False)
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.
演示:
>>> import json
>>> print(json.loads(r'"l\u00f6yt\u00e4\u00e4"'))
löytää
>>> print(json.dumps(json.loads(r'"l\u00f6yt\u00e4\u00e4"')))
"l\u00f6yt\u00e4\u00e4"
>>> print(json.dumps(json.loads(r'"l\u00f6yt\u00e4\u00e4"'), ensure_ascii=False))
"löytää"
如果您有非常大的文档,您可以仍然按文本逐行处理它们,但使用正则表达式进行替换:
import re
unicode_escape = re.compile(
r'(?<!\)'
r'(?:\u([dD][89abAB][a-fA-F0-9]{2})\u([dD][c-fC-F][a-fA-F0-9]{2})'
r'|\u([a-fA-F0-9]{4}))')
def replace(m):
return bytes.fromhex(''.join(m.groups(''))).decode('utf-16-be')
with open("input.json", "r", encoding="utf-8") as input:
with open("output.txt", "w", encoding="utf-8") as output:
for line in input:
output.write(unicode_escape.sub(replace, line))
但是,如果您的 JSON 在字符串中嵌入了 JSON 文档,或者如果转义序列前面有 escaped 反斜杠,则此操作失败。
我有一个如下所示的文本文件:
- l\u00f6yt\u00e4\u00e4
但是所有 unicode 字符都需要用相应的字符替换,并且应该如下所示:
- löytää
问题是我不想自己替换所有 unicode 代码,自动执行此操作的最有效方法是什么? 我的代码现在看起来像这样,但它确实需要改进!(代码在 Python3)
import io
input = io.open("input.json", "r", encoding="utf-8")
output = io.open("output.txt", "w", encoding="utf-8")
with input, output:
# Read input file.
file = input.read()
file = file.replace("\u00e4", "ä")
# I think last line is the same as line below:
# file = file .replace("\u00e4", u"\u00e4")
file = file.replace("\u00c4", "Ä")
file = file.replace("\u00f6", "ö")
file = file.replace("\u00d6", "Ö")
.
.
.
# I cannot put all codes in unicode here manually!
.
.
.
# writing output file
output.write(file)
只需将 JSON 解码为 JSON,然后在不确保数据是 ASCII 安全的情况下写出一个新的 JSON 文档:
import json
with open("input.json", "r", encoding="utf-8") as input:
with open("output.txt", "w", encoding="utf-8") as output:
document = json.load(input)
json.dump(document, output, ensure_ascii=False)
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.
演示:
>>> import json
>>> print(json.loads(r'"l\u00f6yt\u00e4\u00e4"'))
löytää
>>> print(json.dumps(json.loads(r'"l\u00f6yt\u00e4\u00e4"')))
"l\u00f6yt\u00e4\u00e4"
>>> print(json.dumps(json.loads(r'"l\u00f6yt\u00e4\u00e4"'), ensure_ascii=False))
"löytää"
如果您有非常大的文档,您可以仍然按文本逐行处理它们,但使用正则表达式进行替换:
import re
unicode_escape = re.compile(
r'(?<!\)'
r'(?:\u([dD][89abAB][a-fA-F0-9]{2})\u([dD][c-fC-F][a-fA-F0-9]{2})'
r'|\u([a-fA-F0-9]{4}))')
def replace(m):
return bytes.fromhex(''.join(m.groups(''))).decode('utf-16-be')
with open("input.json", "r", encoding="utf-8") as input:
with open("output.txt", "w", encoding="utf-8") as output:
for line in input:
output.write(unicode_escape.sub(replace, line))
但是,如果您的 JSON 在字符串中嵌入了 JSON 文档,或者如果转义序列前面有 escaped 反斜杠,则此操作失败。