Grep 来自日志文件的摘要统计信息
Grep summary statistics from Log files
我有数千个文件是用 STAR 进行 RNA-Seq 分析比对得到的。每个文件都是一个日志 ("*Log.final.out"),它为每个通道(每个样本总共 4 个通道)总结了统计信息。由于我必须在一个唯一文件中组合所有统计信息,我必须为每个文件、每个通道提取以下信息:输入读取数、唯一映射读取数和唯一映射读取百分比。有没有一种方法可以为每个文件提取我需要的所有信息,而无需手动复制和粘贴它们?
这里是日志文件的示例:
Started job on | Jul 17 18:34:39
Started mapping on | Jul 17 18:34:39
Finished on | Jul 17 18:35:44
Mapping speed, Million of reads per hour | 507.64
Number of input reads | 9165655
Average input read length | 76
UNIQUE READS:
Uniquely mapped reads number | 7953458
Uniquely mapped reads % | 86.77%
Average mapped length | 73.74
Number of splices: Total | 1924655
Number of splices: Annotated (sjdb) | 1892117
Number of splices: GT/AG | 1909019
Number of splices: GC/AG | 6636
Number of splices: AT/AC | 1016
Number of splices: Non-canonical | 7984
Mismatch rate per base, % | 0.43%
Deletion rate per base | 0.01%
Deletion average length | 1.40
Insertion rate per base | 0.01%
Insertion average length | 1.30
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1179823
% of reads mapped to multiple loci | 12.87%
Number of reads mapped to too many loci | 9207
% of reads mapped to too many loci | 0.10%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 0.22%
% of reads unmapped: other | 0.04%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
试试这个:
path <- <PATH TO *.out FILES>
files <- list.files(path, pattern = ".out")
library(tidyverse)
merge_out <- function (files) {
df <- df <- read.delim(paste0(path, files[1]), header= F) %>%
filter(grepl("Number of input reads", V1) |
grepl("Uniquely mapped reads", V1) |
grepl("Uniquely mapped reads %", V1)) %>%
set_names("Var", "value")
}
results <- lapply(files, merge_out)
如果有帮助请告诉我。
我有数千个文件是用 STAR 进行 RNA-Seq 分析比对得到的。每个文件都是一个日志 ("*Log.final.out"),它为每个通道(每个样本总共 4 个通道)总结了统计信息。由于我必须在一个唯一文件中组合所有统计信息,我必须为每个文件、每个通道提取以下信息:输入读取数、唯一映射读取数和唯一映射读取百分比。有没有一种方法可以为每个文件提取我需要的所有信息,而无需手动复制和粘贴它们?
这里是日志文件的示例:
Started job on | Jul 17 18:34:39
Started mapping on | Jul 17 18:34:39
Finished on | Jul 17 18:35:44
Mapping speed, Million of reads per hour | 507.64
Number of input reads | 9165655
Average input read length | 76
UNIQUE READS:
Uniquely mapped reads number | 7953458
Uniquely mapped reads % | 86.77%
Average mapped length | 73.74
Number of splices: Total | 1924655
Number of splices: Annotated (sjdb) | 1892117
Number of splices: GT/AG | 1909019
Number of splices: GC/AG | 6636
Number of splices: AT/AC | 1016
Number of splices: Non-canonical | 7984
Mismatch rate per base, % | 0.43%
Deletion rate per base | 0.01%
Deletion average length | 1.40
Insertion rate per base | 0.01%
Insertion average length | 1.30
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1179823
% of reads mapped to multiple loci | 12.87%
Number of reads mapped to too many loci | 9207
% of reads mapped to too many loci | 0.10%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 0.22%
% of reads unmapped: other | 0.04%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
试试这个:
path <- <PATH TO *.out FILES>
files <- list.files(path, pattern = ".out")
library(tidyverse)
merge_out <- function (files) {
df <- df <- read.delim(paste0(path, files[1]), header= F) %>%
filter(grepl("Number of input reads", V1) |
grepl("Uniquely mapped reads", V1) |
grepl("Uniquely mapped reads %", V1)) %>%
set_names("Var", "value")
}
results <- lapply(files, merge_out)
如果有帮助请告诉我。