如何使用 Rust 从 stdin 创建一个高效的字符迭代器?
How can I create an efficient iterator of chars from stdin with Rust?
既然 Read::chars
iterator has been officially deprecated,在不将整个流读入内存的情况下获取来自 Reader
的字符的迭代器(如 stdin)的正确方法是什么?
The corresponding issue for deprecation 很好地总结了 Read::chars
的问题并提供了建议:
Code that does not care about processing data incrementally can use
Read::read_to_string
instead. Code that does care presumably also
wants to control its buffering strategy and work with &[u8]
and
&str
slices that are as large as possible, rather than one char
at
a time. It should be based on the str::from_utf8
function as well as
the valid_up_to
and error_len
methods of the
Utf8Error
type. One tricky aspect is dealing with cases where a single char
is
represented in UTF-8 by multiple bytes where those bytes happen to be
split across separate read
calls / buffer chunks.
(Utf8Error::error_len
returning None
indicates that this may be
the case.) The utf-8
crate solves
this, but in order to be flexible provides an API that probably has
too much surface to be included in the standard library.
Of course the above is for data that is always UTF-8. If other
character encoding need to be supported, consider using the
encoding_rs
or
encoding
crate.
你自己的迭代器
就I/O次调用而言,最有效的解决方案是将所有内容读入一个巨大的缓冲区String
并迭代:
use std::io::{self, Read};
fn main() {
let stdin = io::stdin();
let mut s = String::new();
stdin.lock().read_to_string(&mut s).expect("Couldn't read");
for c in s.chars() {
println!(">{}<", c);
}
}
您可以将其与 的答案结合起来:
use std::io::{self, Read};
fn reader_chars<R: Read>(mut rdr: R) -> io::Result<impl Iterator<Item = char>> {
let mut s = String::new();
rdr.read_to_string(&mut s)?;
Ok(s.into_chars()) // from
}
fn main() -> io::Result<()> {
let stdin = io::stdin();
for c in reader_chars(stdin.lock())? {
println!(">{}<", c);
}
Ok(())
}
我们现在有一个函数 returns 一个 char
的迭代器,用于任何实现 Read
.
的类型
一旦有了这种模式,就只需要决定在何处权衡内存分配与 I/O 请求。这是一个使用行大小缓冲区的类似想法:
use std::io::{BufRead, BufReader, Read};
fn reader_chars<R: Read>(rdr: R) -> impl Iterator<Item = char> {
// We use 6 bytes here to force emoji to be segmented for demo purposes
// Pick more appropriate size for your case
let reader = BufReader::with_capacity(6, rdr);
reader
.lines()
.flat_map(|l| l) // Ignoring any errors
.flat_map(|s| s.into_chars()) // from
}
fn main() {
// emoji are 4 bytes each
let data = "";
let data = data.as_bytes();
for c in reader_chars(data) {
println!(">{}<", c);
}
}
最极端的情况是对每个字符执行一个 I/O 请求。这不会占用太多内存,但会有很多 I/O 开销。
务实的回答
将 Read::chars
的实现复制并粘贴到您自己的代码中。它会像以前一样工作。
另请参阅:
- How do you iterate over a string by character
正如其他一些人所提到的,可以复制 the deprecated implementation of Read::chars
以在您自己的代码中使用。这是否真正理想将取决于您的用例——对我来说,这证明现在已经足够好了,尽管我的应用程序很可能在不久的将来不再使用这种方法。
为了说明如何做到这一点,让我们看一个具体的例子:
use std::io::{self, Error, ErrorKind, Read};
use std::result;
use std::str;
struct MyReader<R> {
inner: R,
}
impl<R: Read> MyReader<R> {
fn new(inner: R) -> MyReader<R> {
MyReader {
inner,
}
}
#[derive(Debug)]
enum MyReaderError {
NotUtf8,
Other(Error),
}
impl<R: Read> Iterator for MyReader<R> {
type Item = result::Result<char, MyReaderError>;
fn next(&mut self) -> Option<result::Result<char, MyReaderError>> {
let first_byte = match read_one_byte(&mut self.inner)? {
Ok(b) => b,
Err(e) => return Some(Err(MyReaderError::Other(e))),
};
let width = utf8_char_width(first_byte);
if width == 1 {
return Some(Ok(first_byte as char));
}
if width == 0 {
return Some(Err(MyReaderError::NotUtf8));
}
let mut buf = [first_byte, 0, 0, 0];
{
let mut start = 1;
while start < width {
match self.inner.read(&mut buf[start..width]) {
Ok(0) => return Some(Err(MyReaderError::NotUtf8)),
Ok(n) => start += n,
Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => return Some(Err(MyReaderError::Other(e))),
}
}
}
Some(match str::from_utf8(&buf[..width]).ok() {
Some(s) => Ok(s.chars().next().unwrap());
None => Err(MyReaderError::NotUtf8),
})
}
}
以上代码还需要read_one_byte
和utf8_char_width
才能实现。这些应该看起来像:
static UTF8_CHAR_WIDTH: [u8; 256] = [
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x1F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x3F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x5F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x7F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0x9F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0xBF
0,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // 0xDF
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, // 0xEF
4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0, // 0xFF
];
fn utf8_char_width(b: u8) -> usize {
return UTF8_CHAR_WIDTH[b as usize] as usize;
}
fn read_one_byte(reader: &mut Read) -> Option<io::Result<u8>> {
let mut buf = [0];
loop {
return match reader.read(&mut buf) {
Ok(0) => None,
Ok(..) => Some(Ok(buf[0])),
Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => Some(Err(e)),
};
}
}
现在我们可以使用 MyReader
实现在某些 reader 上生成 char
的迭代器,例如 io::stdin::Stdin
:
fn main() {
let stdin = io::stdin();
let mut reader = MyReader::new(stdin.lock());
for c in reader {
println!("{}", c);
}
}
此方法的局限性在 original issue thread. One particular concern 中进行了详细讨论,但值得指出的是,此迭代器无法正确处理非 UTF-8 编码流。
既然 Read::chars
iterator has been officially deprecated,在不将整个流读入内存的情况下获取来自 Reader
的字符的迭代器(如 stdin)的正确方法是什么?
The corresponding issue for deprecation 很好地总结了 Read::chars
的问题并提供了建议:
Code that does not care about processing data incrementally can use
Read::read_to_string
instead. Code that does care presumably also wants to control its buffering strategy and work with&[u8]
and&str
slices that are as large as possible, rather than onechar
at a time. It should be based on thestr::from_utf8
function as well as thevalid_up_to
anderror_len
methods of theUtf8Error
type. One tricky aspect is dealing with cases where a singlechar
is represented in UTF-8 by multiple bytes where those bytes happen to be split across separateread
calls / buffer chunks. (Utf8Error::error_len
returningNone
indicates that this may be the case.) Theutf-8
crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the
encoding_rs
orencoding
crate.
你自己的迭代器
就I/O次调用而言,最有效的解决方案是将所有内容读入一个巨大的缓冲区String
并迭代:
use std::io::{self, Read};
fn main() {
let stdin = io::stdin();
let mut s = String::new();
stdin.lock().read_to_string(&mut s).expect("Couldn't read");
for c in s.chars() {
println!(">{}<", c);
}
}
您可以将其与
use std::io::{self, Read};
fn reader_chars<R: Read>(mut rdr: R) -> io::Result<impl Iterator<Item = char>> {
let mut s = String::new();
rdr.read_to_string(&mut s)?;
Ok(s.into_chars()) // from
}
fn main() -> io::Result<()> {
let stdin = io::stdin();
for c in reader_chars(stdin.lock())? {
println!(">{}<", c);
}
Ok(())
}
我们现在有一个函数 returns 一个 char
的迭代器,用于任何实现 Read
.
一旦有了这种模式,就只需要决定在何处权衡内存分配与 I/O 请求。这是一个使用行大小缓冲区的类似想法:
use std::io::{BufRead, BufReader, Read};
fn reader_chars<R: Read>(rdr: R) -> impl Iterator<Item = char> {
// We use 6 bytes here to force emoji to be segmented for demo purposes
// Pick more appropriate size for your case
let reader = BufReader::with_capacity(6, rdr);
reader
.lines()
.flat_map(|l| l) // Ignoring any errors
.flat_map(|s| s.into_chars()) // from
}
fn main() {
// emoji are 4 bytes each
let data = "";
let data = data.as_bytes();
for c in reader_chars(data) {
println!(">{}<", c);
}
}
最极端的情况是对每个字符执行一个 I/O 请求。这不会占用太多内存,但会有很多 I/O 开销。
务实的回答
将 Read::chars
的实现复制并粘贴到您自己的代码中。它会像以前一样工作。
另请参阅:
- How do you iterate over a string by character
正如其他一些人所提到的,可以复制 the deprecated implementation of Read::chars
以在您自己的代码中使用。这是否真正理想将取决于您的用例——对我来说,这证明现在已经足够好了,尽管我的应用程序很可能在不久的将来不再使用这种方法。
为了说明如何做到这一点,让我们看一个具体的例子:
use std::io::{self, Error, ErrorKind, Read};
use std::result;
use std::str;
struct MyReader<R> {
inner: R,
}
impl<R: Read> MyReader<R> {
fn new(inner: R) -> MyReader<R> {
MyReader {
inner,
}
}
#[derive(Debug)]
enum MyReaderError {
NotUtf8,
Other(Error),
}
impl<R: Read> Iterator for MyReader<R> {
type Item = result::Result<char, MyReaderError>;
fn next(&mut self) -> Option<result::Result<char, MyReaderError>> {
let first_byte = match read_one_byte(&mut self.inner)? {
Ok(b) => b,
Err(e) => return Some(Err(MyReaderError::Other(e))),
};
let width = utf8_char_width(first_byte);
if width == 1 {
return Some(Ok(first_byte as char));
}
if width == 0 {
return Some(Err(MyReaderError::NotUtf8));
}
let mut buf = [first_byte, 0, 0, 0];
{
let mut start = 1;
while start < width {
match self.inner.read(&mut buf[start..width]) {
Ok(0) => return Some(Err(MyReaderError::NotUtf8)),
Ok(n) => start += n,
Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => return Some(Err(MyReaderError::Other(e))),
}
}
}
Some(match str::from_utf8(&buf[..width]).ok() {
Some(s) => Ok(s.chars().next().unwrap());
None => Err(MyReaderError::NotUtf8),
})
}
}
以上代码还需要read_one_byte
和utf8_char_width
才能实现。这些应该看起来像:
static UTF8_CHAR_WIDTH: [u8; 256] = [
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x1F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x3F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x5F
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // 0x7F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0x9F
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 0xBF
0,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // 0xDF
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, // 0xEF
4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0, // 0xFF
];
fn utf8_char_width(b: u8) -> usize {
return UTF8_CHAR_WIDTH[b as usize] as usize;
}
fn read_one_byte(reader: &mut Read) -> Option<io::Result<u8>> {
let mut buf = [0];
loop {
return match reader.read(&mut buf) {
Ok(0) => None,
Ok(..) => Some(Ok(buf[0])),
Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => Some(Err(e)),
};
}
}
现在我们可以使用 MyReader
实现在某些 reader 上生成 char
的迭代器,例如 io::stdin::Stdin
:
fn main() {
let stdin = io::stdin();
let mut reader = MyReader::new(stdin.lock());
for c in reader {
println!("{}", c);
}
}
此方法的局限性在 original issue thread. One particular concern 中进行了详细讨论,但值得指出的是,此迭代器无法正确处理非 UTF-8 编码流。