Rust - 开放动态数量的作者

Rust - open dynamic number of writers

假设我有来自文件(条形码)的动态输入字符串数。 我想根据与输入字符串的匹配拆分一个 111GB 的巨大文本文件,并将这些匹配项写入文件。

我不知道会有多少输入。

我已经完成了所有的文件输入和字符串匹配,但是卡在了输出步骤。

理想情况下,我会为输入向量条形码中的每个输入打开一个文件,只包含字符串。有什么方法可以打开动态数量的输出文件吗?

一种次优的方法是搜索条形码字符串作为输入参数,但这意味着我必须反复读取这个巨大的文件。

条形码输入向量只包含字符串,例如 “塔格塔特”, "TAGAGTAG",

理想情况下,如果输入前两个字符串,输出应该是这样的

file1 -> TAGAGTAT.txt
file2 -> TAGAGTAG.txt

感谢您的帮助。

extern crate needletail;
use needletail::{parse_fastx_file, Sequence, FastxReader};
use std::str;
use std::fs::File;
use std::io::prelude::*;
use std::path::Path;

fn read_barcodes () -> Vec<String> {
    
    // TODO - can replace this with file reading code (OR move to an arguments based model, parse and demultiplex only one oligomer at a time..... )

    // The `vec!` macro can be used to initialize a vector or strings
    let barcodes = vec![
        "TCTCAAAG".to_string(),
        "AACTCCGC".into(),
        "TAAACGCG".into()
        ];
        println!("Initial vector: {:?}", barcodes);
        return barcodes
} 

fn main() {
    //let filename = "test5m.fastq";

    let filename = "Undetermined_S0_R1.fastq";

    println!("Fastq filename: {} ", filename);
    //println!("Barcodes filename: {} ", barcodes_filename);

    let barcodes_vector: Vec<String> = read_barcodes();
    let mut counts_vector: [i32; 30] = [0; 30];

    let mut n_bases = 0;
    let mut n_valid_kmers = 0;
    let mut reader = parse_fastx_file(&filename).expect("Not a valid path/file");
    while let Some(record) = reader.next() {
        let seqrec = record.expect("invalid record");

        // get sequence
        let sequenceBytes = seqrec.normalize(false);
        
        let sequenceText = str::from_utf8(&sequenceBytes).unwrap();
        //println!("Seq: {} ", &sequenceText);

        // get first 8 chars (8chars x 2 bytes)
        let sequenceOligo = &sequenceText[0..8]; 
        //println!("barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
        if sequenceOligo == barcodes_vector[0]{
            //println!("Hit ! Barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
            counts_vector[0] =  counts_vector[0] + 1;

        }  

您可能想要 HashMap<String, File>。您可以像这样从条形码向量构建它:

use std::collections::HashMap;
use std::fs::File;
use std::path::Path;

fn build_file_map(barcodes: &[String]) -> HashMap<String, File> {
    let mut files = HashMap::new();

    for barcode in barcodes {
        let filename = Path::new(barcode).with_extension("txt");
        let file = File::create(filename).expect("failed to create output file");
        files.insert(barcode.clone(), file);
    }

    files
}

你可以这样称呼它:

let barcodes = vec!["TCTCAAAG".to_string(), "AACTCCGC".into(), "TAAACGCG".into()];
let file_map = build_file_map(&barcodes);

你会得到一个文件,可以像这样写入:

let barcode = barcodes[0];
let file = file_map.get(&barcode).expect("barcode not in file map");
// write to file

I just need an example of a) how to properly instantiate a vector of files named after the relevant string b) setup the output file objects properly c) write to those files.

这是一个注释示例:

use std::io::Write;
use std::fs::File;
use std::io;

fn read_barcodes() -> Vec<String> {
    // read barcodes here
    todo!()
}

fn process_barcode(barcode: &str) -> String {
    // process barcodes here
    todo!()
}

fn main() -> io::Result<()> {
    let barcodes = read_barcodes();
    
    for barcode in barcodes {
        // process barcode to get output
        let output = process_barcode(&barcode);
        
        // create file for barcode with {barcode}.txt name
        let mut file = File::create(format!("{}.txt", barcode))?;
        
        // write output to created file
        file.write_all(output.as_bytes());
    }
    
    Ok(())
}