将尖括号内嵌的 UTF-8 十六进制值转换为重音字符

Convert angle bracketed embedded UTF-8 hex values to accented character

在pandas read_json中,UTF-8 重音字符或双音字符被转换为相应十六进制值的尖括号。如何避免或修复此问题以呈现实际的 UTF-8 字符值?

考虑以下示例,该示例从所有当前 R GitHub CRAN 包的 S3 存储桶中提取。注意重音字符的输出由尖括号中的十六进制值表示:

import pandas as pd

# S3 (REQUIRES s3fs)
json_df = pd.read_json("s3://public-r-data/ghcran.json")

# URL (NO REQUIREMENT)
json_df = pd.read_json("http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json")

json_df.loc[884, "Title"]
# Misc Functions of Eduard Sz<c3><b6>cs

json_df.loc[213, "Author"]
# Kirill M<c3><bc>ller [aut, cre]

json_df.loc[336, "Maintainer"]
# H<c3><a9>l<c3><a8>ne Morlon <morlon@biologie.ens.fr>

为了避免这些十六进制值的映射字典替换解决方案,是否有紧凑的解决方案来避免或将此类嵌入的十六进制值修复为实际的 UTF-8 字符?具体来说,如何将上面的结果转化为:

# Misc Functions of Eduard Szöcs

# Kirill Müller [aut, cre]

# Hélène Morlon <morlon@biologie.ens.fr>

啊...错误编码字符串的痛苦!我曾经遇到过类似的问题,这里是我如何解决它的改编版本。

它将原始字符串分解为字节数组,然后将一些字节组合在一起,并将新数组重新编码为UTF-8。它远非万无一失,但可能不是问题,具体取决于您的要求:

def reencode(string):
    def is_hex(i):
        return (48 <= i <= 57) or (97 <= i <= 102)
    
    def to_bytes(arr):
        if len(arr) != 4:
            return None
        
        a, b, c, d = arr
        if a == 60 and is_hex(b) and is_hex(c) and d == 62:
            return bytes.fromhex(chr(b) + chr(c))

        return None

    old = string.encode('ascii')
    new = bytearray()

    i = 0
    while i < len(old):
        b = to_bytes(old[i:i+4])
        if b:
            new.extend(b)
            i += 4
        else:
            new.append(old[i])
            i += 1
    return new.decode('utf8')

测试:

df = pd.DataFrame([
    'Misc Functions of Eduard Sz<c3><b6>cs',
    'Kirill M<c3><bc>ller [aut, cre]',
    'H<c3><a9>l<c3><a8>ne Morlon <morlon@biologie.ens.fr>'
], columns=['name'])
df['name'].apply(reencode)

输出:

0            Misc Functions of Eduard Szöcs
1                  Kirill Müller [aut, cre]
2    Hélène Morlon <morlon@biologie.ens.fr>