Grep 来自日志文件的摘要统计信息

Grep summary statistics from Log files

我有数千个文件是用 STAR 进行 RNA-Seq 分析比对得到的。每个文件都是一个日志 ("*Log.final.out"),它为每个通道(每个样本总共 4 个通道)总结了统计信息。由于我必须在一个唯一文件中组合所有统计信息,我必须为每个文件、每个通道提取以下信息:输入读取数、唯一映射读取数和唯一映射读取百分比。有没有一种方法可以为每个文件提取我需要的所有信息,而无需手动复制和粘贴它们?

这里是日志文件的示例:

                             Started job on |   Jul 17 18:34:39
                         Started mapping on |   Jul 17 18:34:39
                                Finished on |   Jul 17 18:35:44
   Mapping speed, Million of reads per hour |   507.64

                      Number of input reads |   9165655
                  Average input read length |   76
                                UNIQUE READS:
               Uniquely mapped reads number |   7953458
                    Uniquely mapped reads % |   86.77%
                      Average mapped length |   73.74
                   Number of splices: Total |   1924655
        Number of splices: Annotated (sjdb) |   1892117
                   Number of splices: GT/AG |   1909019
                   Number of splices: GC/AG |   6636
                   Number of splices: AT/AC |   1016
           Number of splices: Non-canonical |   7984
                  Mismatch rate per base, % |   0.43%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.40
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.30
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   1179823
         % of reads mapped to multiple loci |   12.87%
    Number of reads mapped to too many loci |   9207
         % of reads mapped to too many loci |   0.10%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   0.22%
                 % of reads unmapped: other |   0.04%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

试试这个:

path <- <PATH TO *.out FILES>
files <- list.files(path, pattern = ".out")

library(tidyverse)
merge_out <- function (files) {
  df <- df <- read.delim(paste0(path, files[1]), header= F) %>% 
    filter(grepl("Number of input reads", V1) |
           grepl("Uniquely mapped reads", V1) |
           grepl("Uniquely mapped reads %", V1)) %>% 
    set_names("Var", "value")
}

results <- lapply(files, merge_out)

如果有帮助请告诉我。