如何使用 Rust 从 stdin 创建一个高效的字符迭代器?

How can I create an efficient iterator of chars from stdin with Rust?

既然 Read::chars iterator has been officially deprecated,在不将整个流读入内存的情况下获取来自 Reader 的字符的迭代器(如 stdin)的正确方法是什么?

The corresponding issue for deprecation 很好地总结了 Read::chars 的问题并提供了建议:

Code that does not care about processing data incrementally can use Read::read_to_string instead. Code that does care presumably also wants to control its buffering strategy and work with &[u8] and &str slices that are as large as possible, rather than one char at a time. It should be based on the str::from_utf8 function as well as the valid_up_to and error_len methods of the Utf8Error type. One tricky aspect is dealing with cases where a single char is represented in UTF-8 by multiple bytes where those bytes happen to be split across separate read calls / buffer chunks. (Utf8Error::error_len returning None indicates that this may be the case.) The utf-8 crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.

Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the encoding_rs or encoding crate.

你自己的迭代器

I/O次调用而言,最有效的解决方案是将所有内容读入一个巨大的缓冲区String 并迭代:

use std::io::{self, Read};

fn main() {
    let stdin = io::stdin();
    let mut s = String::new();
    stdin.lock().read_to_string(&mut s).expect("Couldn't read");
    for c in s.chars() {
        println!(">{}<", c);
    }
}

您可以将其与 的答案结合起来:

use std::io::{self, Read};

fn reader_chars<R: Read>(mut rdr: R) -> io::Result<impl Iterator<Item = char>> {
    let mut s = String::new();
    rdr.read_to_string(&mut s)?;
    Ok(s.into_chars()) // from 
}

fn main() -> io::Result<()> {
    let stdin = io::stdin();

    for c in reader_chars(stdin.lock())? {
        println!(">{}<", c);
    }

    Ok(())
}

我们现在有一个函数 returns 一个 char 的迭代器,用于任何实现 Read.

的类型

一旦有了这种模式,就只需要决定在何处权衡内存分配与 I/O 请求。这是一个使用行大小缓冲区的类似想法:

use std::io::{BufRead, BufReader, Read};

fn reader_chars<R: Read>(rdr: R) -> impl Iterator<Item = char> {
    // We use 6 bytes here to force emoji to be segmented for demo purposes
    // Pick more appropriate size for your case
    let reader = BufReader::with_capacity(6, rdr);

    reader
        .lines()
        .flat_map(|l| l) // Ignoring any errors
        .flat_map(|s| s.into_chars())  // from 
}

fn main() {
    // emoji are 4 bytes each
    let data = "";
    let data = data.as_bytes();

    for c in reader_chars(data) {
        println!(">{}<", c);
    }
}

最极端的情况是对每个字符执行一个 I/O 请求。这不会占用太多内存,但会有很多 I/O 开销。

务实的回答

Read::chars 的实现复制并粘贴到您自己的代码中。它会像以前一样工作。

另请参阅:

  • How do you iterate over a string by character

正如其他一些人所提到的,可以复制 the deprecated implementation of Read::chars 以在您自己的代码中使用。这是否真正理想将取决于您的用例——对我来说,这证明现在已经足够好了,尽管我的应用程序很可能在不久的将来不再使用这种方法。

为了说明如何做到这一点,让我们看一个具体的例子:

use std::io::{self, Error, ErrorKind, Read};
use std::result;
use std::str;

struct MyReader<R> {
    inner: R,
}

impl<R: Read> MyReader<R> {
    fn new(inner: R) -> MyReader<R> {
        MyReader {
            inner,
        }
    }

#[derive(Debug)]
enum MyReaderError {
    NotUtf8,
    Other(Error),
}

impl<R: Read> Iterator for MyReader<R> {
    type Item = result::Result<char, MyReaderError>;

    fn next(&mut self) -> Option<result::Result<char, MyReaderError>> {
        let first_byte = match read_one_byte(&mut self.inner)? {
            Ok(b) => b,
            Err(e) => return Some(Err(MyReaderError::Other(e))),
        };
        let width = utf8_char_width(first_byte);
        if width == 1 {
            return Some(Ok(first_byte as char));
        }
        if width == 0 {
            return Some(Err(MyReaderError::NotUtf8));
        }
        let mut buf = [first_byte, 0, 0, 0];
        {
            let mut start = 1;
            while start < width {
                match self.inner.read(&mut buf[start..width]) {
                    Ok(0) => return Some(Err(MyReaderError::NotUtf8)),
                    Ok(n) => start += n,
                    Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
                    Err(e) => return Some(Err(MyReaderError::Other(e))),
                }
            }
        }
        Some(match str::from_utf8(&buf[..width]).ok() {
            Some(s) => Ok(s.chars().next().unwrap());
            None => Err(MyReaderError::NotUtf8),
        })
    }
}

以上代码还需要read_one_byteutf8_char_width才能实现。这些应该看起来像:

static UTF8_CHAR_WIDTH: [u8; 256] = [
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x1F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x3F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x5F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x7F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0x9F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0xBF
0,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // 0xDF
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, // 0xEF
4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0, // 0xFF
];

fn utf8_char_width(b: u8) -> usize {
    return UTF8_CHAR_WIDTH[b as usize] as usize;
}

fn read_one_byte(reader: &mut Read) -> Option<io::Result<u8>> {
    let mut buf = [0];
    loop {
        return match reader.read(&mut buf) {
            Ok(0) => None,
            Ok(..) => Some(Ok(buf[0])),
            Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
            Err(e) => Some(Err(e)),
        };
    }
}

现在我们可以使用 MyReader 实现在某些 reader 上生成 char 的迭代器,例如 io::stdin::Stdin:

fn main() {
    let stdin = io::stdin();
    let mut reader = MyReader::new(stdin.lock());
    for c in reader {
        println!("{}", c);
    }
}

此方法的局限性在 original issue thread. One particular concern 中进行了详细讨论,但值得指出的是,此迭代器无法正确处理非 UTF-8 编码流。