如何使用 rust-polars 读取压缩的 TSV 文件 (*.gtf.gz)？

Question

来自 python 的 Rust 初学者。我想使用 rust-polars 读取压缩的 GTF (*.gtf.gz) 文件：

    let schema = Arc::new(Schema::new(vec![
        Field::new("contigName", DataType::Categorical),
        Field::new("source", DataType::Utf8),
        Field::new("feature", DataType::Categorical),
        Field::new("start", DataType::Int64),
        Field::new("end", DataType::Int64),
        Field::new("score", DataType::Float32),
        Field::new("strand", DataType::Categorical),
        Field::new("frame", DataType::Categorical),
        Field::new("attribute", DataType::Utf8),
    ]));

    let mut df = CsvReader::from_path(r).unwrap()
        .with_delimiter(b'\t')
        .with_schema(&schema)
        .with_comment_char(Some(b'#'))
        .with_n_threads(Some(1)) // comment for multithreading
        .with_encoding(CsvEncoding::LossyUtf8)
        .has_header(false)
        .finish()?;

    let test = df.head(Some(10));
    println!("{}", test);

但是，我遇到了一些问题：

如何告诉 Polars 文件已压缩？
我尝试传递 io::BufReader::new(GzDecoder::new(f)) 而不是文件，但失败了。
如何解析分类列？
如何处理可能缺失或增加的列？
如何读取一个以'#'作为header和'##'作为注释的文件？

Answer 1

您好，现在有几个问题。我会尽量回答我能回答的。

How to tell Polars that the file is compressed?

您不必这样做。您只需使用 decompress 或 decompress-fast 功能标志编译 polars。（第一个是rust native，后者需要c编译器）。

How to parse Categorical columns

您将架构设置为 DataType::Categorical，或者您先解析为 Utf8，然后再转换。

df.may_apply("some_utf8_column", |s| s.cast(&DataType::Categorical));

How to handle possibly missing or additional columns?

不知道你说的句柄是什么意思？

How to read a file which has '#' as header and '##' as comment?

Polars 目前只允许一个评论字符。您可以设置此注释字符，以该字符开头的每一行都将被忽略。

如何使用 rust-polars 读取压缩的 TSV 文件 (*.gtf.gz)？

How to read compressed TSV files (*.gtf.gz) with rust-polars?

rust

rust-polars