是否可以将字节解码为 UTF-8，将错误转换为 Rust 中的转义序列？

Question

在 Rust 中，可以通过这样做从字节中获取 UTF-8：

if let Ok(s) = str::from_utf8(some_u8_slice) {
    println!("example {}", s);
}

这要么有效要么无效，但是 Python 具有处理错误的能力，例如：

s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');

在此示例中，参数 surrogateescape 将无效的 utf-8 序列转换为转义码，因此不会忽略或替换无法解码的文本，而是将它们替换为字节文字表达式，有效 utf-8。参见：Python docs 了解详情。

Rust 是否有办法从字节中获取 UTF-8 字符串，从而避免错误而不是完全失败？

Answer 1

是的，通过 String::from_utf8_lossy:

fn main() {
    let text = [104, 101, 0xFF, 108, 111];
    let s = String::from_utf8_lossy(&text);
    println!("{}", s); // he�lo
}

如果您需要对流程进行更多控制，可以使用std::str::from_utf8, as suggested by the 。但是，没有理由像它建议的那样对字节进行双重验证。

一个快速破解的例子：

use std::str;

fn example(mut bytes: &[u8]) -> String {
    let mut output = String::new();

    loop {
        match str::from_utf8(bytes) {
            Ok(s) => {
                // The entire rest of the string was valid UTF-8, we are done
                output.push_str(s);
                return output;
            }
            Err(e) => {
                let (good, bad) = bytes.split_at(e.valid_up_to());

                if !good.is_empty() {
                    let s = unsafe {
                        // This is safe because we have already validated this
                        // UTF-8 data via the call to `str::from_utf8`; there's
                        // no need to check it a second time
                        str::from_utf8_unchecked(good)
                    };
                    output.push_str(s);
                }

                if bad.is_empty() {
                    //  No more data left
                    return output;
                }

                // Do whatever type of recovery you need to here
                output.push_str("<badbyte>");

                // Skip the bad byte and try again
                bytes = &bad[1..];
            }
        }
    }
}

fn main() {
    let r = example(&[104, 101, 0xFF, 108, 111]);
    println!("{}", r); // he<badbyte>lo
}

您可以扩展它以取值来替换坏字节，一个闭包来处理坏字节等。例如：

fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
    // ...    
                handler(&mut output, bad);
    // ...
}

let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
    use std::fmt::Write;
    write!(output, "\U{{{}}}", bytes[0]).unwrap()
});
println!("{}", r); // he\U{255}lo

另请参阅：

How do I convert a Vector of bytes (u8) to a string
.

Answer 2

您可以：

使用strict UTF-8 decoding自己构造，其中returns一个错误，表示解码失败的位置，然后可以转义。但这是低效的，因为您将对每次失败的尝试进行两次解码。
试试 3rd party crates，它提供更多可自定义的字符集解码器。

是否可以将字节解码为 UTF-8，将错误转换为 Rust 中的转义序列？

Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust?

utf-8

rust

unicode-escapes