拆分计数 AWK

Question

我有这样的文件 (VCF)

##fileformat=VCFv4.0
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO    FORMAT      NA00001
Chr02   259 .   A   .   20  .   .   GT:DP:A:C:G:T:PP:GQ 0/0:1:0,1:0,0:0,0:0,0:0,26,23,75,33,33,33,47,52,49:23
Chr02   260 .   C   .   13  .   .   GT:DP:A:C:G:T:PP:GQ 0/0:1:0,0:0,1:0,0:0,0:24,0,70,17,25,49,43,25,25,44:16
Chr02   261 .   C   .   13  .   .   GT:DP:A:C:G:T:PP:GQ 0/0:1:0,0:0,1:0,0:0,0:24,0,194,18,25,49,44,25,25,45:16
Chr02   262 .   C   A   21  .   .   GT:DP:A:C:G:T:PP:GQ 0/1:1:0,0:0,1:0,0:0,0:387,0,342,348,25,368,376,25,25,368:25
Chr02   263 .   C   .   24  .   .   GT:DP:A:C:G:T:PP:GQ 0/0:2:0,0:1,1:0,0:0,0:541,0,529,495,29,556,508,29,29,499:29
Chr02   264 .   A   .   31  .   .   GT:DP:A:C:G:T:PP:GQ 0/0:2:1,1:0,0:0,0:0,0:0,280,192,317,36,36,36,178,302,219:36
Chr02   265 .   G   C   25  .   .   GT:DP:A:C:G:T:PP:GQ 0/1:2:0,0:0,0:1,1:0,0:255,414,0,328,284,29,284,29,351,29:29
Chr02   266 .   A   .   31  .   .   GT:DP:A:C:G:T:PP:GQ 0/0:2:1,1:0,0:0,0:0,0:0,281,323,440,36,36,36,209,309,315:36
Chr02   267 .   C   .   24  .   .   GT:DP:A:C:G:T:PP:GQ 0/0:2:0,0:1,1:0,0:0,0:595,0,541,481,28,567,512,28,28,512:

我需要根据行数 (1000) 拆分此文件并从每个文件（拆分文件）中计算“0/1”。为此，我使用此命令拆分了文件

head -10 Chromosome02.vcf| tee header subset.1.VCF >/dev/null ; awk -v header="`cat header`" -v count=1 '( (NR>10) && !( (NR-1) % 2) ) { count++ ; print header >"subset." count ".VCF";} {print [=11=] >>"subset." count ".VCF";}' Chromosome02.vcf | grep “0/1”

但它不起作用。有没有办法在不生成拆分文件的情况下做到这一点？

预期输出

Chromosome02
00
00
01
00
01
00

Answer 1

您能否根据您显示的示例尝试执行以下操作，并假设我们需要在每 1011 行中计算 0/1 并打印它们。

使用 tail + awk`:

tail -n +11 Input_file | 
awk '
FNR%1000==0{
  if(++count==1){ print "Chromosome02" }
  print total
  total=""
}
{
  total+=gsub(/0\/1/,"&")
}
END{
  if(++count==1){ print "Chromosome02" }
  if(total){ print total }
}'

只有awk:

awk '
FNR==11{
  start=1
}
start && ++line && line%1000==0{
  if(++count==1){ print "Chromosome02" }
  print total
  total=""
}
{
  total+=gsub(/0\/1/,"&")
}
END{
  if(++count==1){ print "Chromosome02" }
  if(total){ print total }
}' Input_file

注意： 这考虑到 OP 的 Input_file 想要从整行计数 0/1 以防你想要检查特定字段然后可以针对特定字段更改上述替换。

Answer 2

tail -n +11 file | 
    awk -v n=1000 '/0\/1/{c++} NR%n==0{print c; c=0} END {if (NR%n!=0) print c}'

tail 命令将从文件中排除 headers。然后我们计算包含该模式的行数，但每 N 行我们打印并重置计数器。

拆分计数 AWK

Split and count AWK

awk

vcf-variant-call-format