修复使用错误字符集编码的文件
Fix file encoded with the wrong charset
我有一个以 us-ascii 编码的文件,如下一个命令所示:
$ file -i /tmp/text
/tmp/text: text/plain; charset=us-ascii
但是它包含很多latin-1编码的字符,例如:
Hij verblijft samen met zijn gezin in Belgi\xc3\xab
Activist Roger Espa\xc3\xb1ol raakte zijn oog kwijt door een politiekogel
我想用正确的字符替换这些错误的字符。
我尝试了什么:
$ iconv -f latin1 -t utf-8 text > text.1
with open("text") as f: text = f.read().encode("latin-1").decode("utf-8")
with open("text", "w") as f: f.write(text)
ftfy -e latin-1 text > text.1
以及上述尝试的许多变体。感谢任何帮助
试试这个 python 脚本:
#!/usr/bin/env python3
import re
def convert(s):
return b'%c' % int(s.group(0)[2:],16)
with open("text", 'rb') as f:
text = re.sub(rb'\x..', convert, f.read())
with open("text", "wb") as f:
f.write(text)
我有一个以 us-ascii 编码的文件,如下一个命令所示:
$ file -i /tmp/text
/tmp/text: text/plain; charset=us-ascii
但是它包含很多latin-1编码的字符,例如:
Hij verblijft samen met zijn gezin in Belgi\xc3\xab
Activist Roger Espa\xc3\xb1ol raakte zijn oog kwijt door een politiekogel
我想用正确的字符替换这些错误的字符。
我尝试了什么:
$ iconv -f latin1 -t utf-8 text > text.1
with open("text") as f: text = f.read().encode("latin-1").decode("utf-8")
with open("text", "w") as f: f.write(text)
ftfy -e latin-1 text > text.1
以及上述尝试的许多变体。感谢任何帮助
试试这个 python 脚本:
#!/usr/bin/env python3
import re
def convert(s):
return b'%c' % int(s.group(0)[2:],16)
with open("text", 'rb') as f:
text = re.sub(rb'\x..', convert, f.read())
with open("text", "wb") as f:
f.write(text)