借用检查器没有意识到 `clear` 删除了对局部变量的引用

Question

以下代码从 stdin 读取 space 分隔的记录，并将逗号分隔的记录写入 stdout。即使使用优化的构建，它也相当慢（大约是使用 awk 的两倍）。

use std::io::BufRead;

fn main() {
    let stdin = std::io::stdin();
    for line in stdin.lock().lines().map(|x| x.unwrap()) {
        let fields: Vec<_> = line.split(' ').collect();
        println!("{}", fields.join(","));
    }
}

一个明显的改进是使用 itertools 加入而不分配向量（collect 调用导致分配）。但是，我尝试了一种不同的方法：

fn main() {
    let stdin = std::io::stdin();
    let mut cache = Vec::<&str>::new();
    for line in stdin.lock().lines().map(|x| x.unwrap()) {
        cache.extend(line.split(' '));
        println!("{}", cache.join(","));
        cache.clear();
    }
}

此版本尝试反复使用相同的矢量。不幸的是，编译器抱怨：

error: `line` does not live long enough
 --> src/main.rs:7:22
  |
7 |         cache.extend(line.split(' '));
  |                      ^^^^
  |
note: reference must be valid for the block suffix following statement 1 at 5:39...
 --> src/main.rs:5:40
  |
5 |     let mut cache = Vec::<&str>::new();
  |                                        ^
note: ...but borrowed value is only valid for the for at 6:4
 --> src/main.rs:6:5
  |
6 |     for line in stdin.lock().lines().map(|x| x.unwrap()) {
  |     ^

error: aborting due to previous error

这当然是有道理的：line 变量仅在 for 循环体中有效，而 cache 在迭代过程中保留指向它的指针。但是这个错误在我看来仍然是虚假的：因为缓存在每次迭代后被 cleared，所以不能保留对 line 的引用，对吗？

我如何将此事告知借阅检查员？

Answer 1

执行此操作的唯一方法是使用 transmute 将 Vec<&'a str> 更改为 Vec<&'b str>。 transmute 是不安全的，如果您在这里忘记调用 clear，Rust 不会引发错误。您可能希望将 unsafe 块扩展到调用 clear 之后，以明确（没有双关语意）代码 returns 到 "safe land".[=19 的位置=]

use std::io::BufRead;
use std::mem;

fn main() {
    let stdin = std::io::stdin();
    let mut cache = Vec::<&str>::new();
    for line in stdin.lock().lines().map(|x| x.unwrap()) {
        let cache: &mut Vec<&str> = unsafe { mem::transmute(&mut cache) };
        cache.extend(line.split(' '));
        println!("{}", cache.join(","));
        cache.clear();
    }
}

Answer 2

在这种情况下，Rust 不知道您要做什么。不幸的是，.clear() 不会影响 .extend() 的检查方式。

cache 是 "vector of strings that live as long as the main function"，但在 extend() 调用中您要附加 "strings that live only as long as one loop iteration"，因此这是类型不匹配。调用 .clear() 不会更改类型。

通常这种限时使用是通过制作一个长寿命的不透明对象来表达的，该对象可以通过借用具有正确生命周期的临时对象来访问其内存，例如RefCell.borrow()给出一个临时的Ref 目的。实现它会有点复杂，并且需要不安全的方法来回收 Vec 的内部内存。

在这种情况下，另一种解决方案可能是完全避免任何分配（.join() 也会分配）并通过 Peekable 迭代器包装器流式传输打印：

for line in stdin.lock().lines().map(|x| x.unwrap()) {
    let mut fields = line.split(' ').peekable();
    while let Some(field) = fields.next() {
        print!("{}", field);
        if fields.peek().is_some() {
            print!(",");
        }
    }
    print!("\n");
}

顺便说一句：Francis 对 transmute 的回答也很好。您可以使用 unsafe 表示您知道自己在做什么并覆盖生命周期检查。

Answer 3

Itertools 具有 .format() 用于延迟格式化的目的，它也跳过分配字符串。

use std::io::BufRead;
use itertools::Itertools;

fn main() {
    let stdin = std::io::stdin();
    for line in stdin.lock().lines().map(|x| x.unwrap()) {
        println!("{}", line.split(' ').format(","));
    }
}

题外话，在此处的另一个答案中，就解决方案的最小意义而言，像这样的东西是“安全抽象”：

fn repurpose<'a, T: ?Sized>(mut v: Vec<&T>) -> Vec<&'a T> {
    v.clear();
    unsafe {
        transmute(v)
    }
}

Answer 4

安全的解决方案是使用 .drain(..) 而不是 .clear()，其中 .. 是 "full range"。它 returns 是一个迭代器，因此可以在循环中处理耗尽的元素。它也可用于其他集合（String、HashMap 等）

fn main() {
    let mut cache = Vec::<&str>::new();
    for line in ["first line allocates for", "second"].iter() {
        println!("Size and capacity: {}/{}", cache.len(), cache.capacity());
        cache.extend(line.split(' '));
        println!("    {}", cache.join(","));
        cache.drain(..);
    }
}

Answer 5

另一种方法是完全避免存储引用，而是存储索引。这个技巧在其他数据结构上下文中也很有用，所以这可能是一个很好的尝试机会。

use std::io::BufRead;

fn main() {
    let stdin = std::io::stdin();
    let mut cache = Vec::new();
    for line in stdin.lock().lines().map(|x| x.unwrap()) {
        cache.push(0);
        cache.extend(line.match_indices(' ').map(|x| x.0 + 1));
        // cache now contains the indices where new words start

        // do something with this information
        for i in 0..(cache.len() - 1) {
            print!("{},", &line[cache[i]..(cache[i + 1] - 1)]);
        }
        println!("{}", &line[*cache.last().unwrap()..]);
        cache.clear();
    }
}

虽然您自己在问题中发表了评论，但我觉得有必要指出有更优雅的方法可以使用迭代器来完成此操作，这可能会完全避免分配向量。

上述方法受到的启发，如果您需要做比打印更复杂的事情，它会变得更有用。

Answer 6

详细阐述 Francis 关于使用 transmute() 的回答，我认为可以安全地抽象出这个简单的函数：

pub fn zombie_vec<'a, 'b, T: ?Sized>(mut data: Vec<&'a T>) -> Vec<&'b T> {
    data.clear();
    unsafe {
        std::mem::transmute(data)
    }
}

使用这个，原始代码将是：

fn main() {
    let stdin = std::io::stdin();
    let mut cache0 = Vec::<&str>::new();
    for line in stdin.lock().lines().map(|x| x.unwrap()) {
        let mut cache = cache0; // into the loop
        cache.extend(line.split(' '));
        println!("{}", cache.join(","));
        cache0 = zombie_vec(cache); // out of the loop
    }
}

您需要将外部向量移动到每个循环迭代中，并将其恢复到完成之前，同时安全地擦除本地生命周期。

借用检查器没有意识到 `clear` 删除了对局部变量的引用

Borrow checker doesn't realize that `clear` drops reference to local variable

lifetime

rust

borrow-checker