如何加速在非常大的单单元 BAM 文件上使用 Regex 的 sed

Question

我有以下尝试计算的简单脚本 SAM/BAM file:

中用“CB:Z”编码的标签

samtools view -h small.bam |  grep "CB:Z:" |
    sed 's/.*CB:Z:\([ACGT]*\).*//' |
    sort |
    uniq -c |
    awk '{print  " " }'

通常它需要处理 4000 万行。该代码大约需要 1 小时才能完成。

这一行sed 's/.*CB:Z:$[ACGT]*$.*//'非常耗时。我怎样才能加快速度？

我使用正则表达式的原因是“CB”标签的列位置不是固定的。有时在第 20 列，有时在第 21 列。

可以找到示例 BAM 文件 HERE.

更新

完整 4000 万行文件的速度比较：

我的初始代码：

real    21m47.088s
user    26m51.148s
sys 1m27.912s

James Brown 的 AWK：

real    1m28.898s
user    2m41.336s
sys 0m6.864s

詹姆斯·布朗与 MAWK：

real    1m10.642s
user    1m41.196s
sys 0m6.484s

Answer 1

另一个 awk，很像@tripleee 的，我想：

$ samtools view -h small.bam | awk '
match([=10=],/CB:Z:[ACGT]*/) {               # use match for the regex match
    a[substr([=10=],RSTART+5,RLENGTH-5)]++   # len(CB:z:)==5, hence +-5
}
END {
    for(i in a)
        print i,a[i]                     # sample output,tweak it to your liking
}'

示例输出：

...
TCTTAATCGTCC 175
GGGAAGGCCTAA 190
TCGGCCGATCGG 32
GACTTCCAAGCC 76
CCGCGGCATCGG 36
TAGCGATCGTGG 125
...

注意：您的sed 's/.*CB:Z:...匹配最后一个实例，而我的awk 'match([=13=],/CB:Z:[ACGT]*/)...匹配第一个实例。

通知 2：在评论中引用@Sundeep：- - 使用 LC_ALL=C mawk '..' 将提供更快的速度。

Answer 2

与perl

perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}'

CB:Z:\K[ACGT]++ 将匹配前面有 CB:Z: 的任何 ACGT 个字符序列。 \K 用于防止 CB:Z: 成为匹配部分的一部分，可通过 $& 变量

获得

small.bam 输入文件的采样时间。 mawk 对于这个输入是最快的，但对于更大的输入文件它可能会改变。

# script.awk is the one mentioned in James Brown's answer
# result here shown with GNU awk
$ time LC_ALL=C awk -f script.awk small.bam > f1
real    0m0.092s

# mawk is faster compared to GNU awk for this use case
$ time LC_ALL=C mawk -f script.awk small.bam > f2
real    0m0.054s

$ time perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}' small.bam > f3
real    0m0.064s

$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical

Answer 3

最好避免首先解析 samtools view 的输出。这是使用 python and the pysam 库获得所需内容的一种方法：

import pysam

from collections import defaultdict

counts = defaultdict(int)
tag = 'CB'

with pysam.AlignmentFile('small.bam') as sam:
    for aln in sam:
        if aln.has_tag(tag):
            counts[ aln.get_tag(tag) ] += 1

for k, v in counts.items():
    print(k, v)

Answer 4

按照您原来的管道方法：

pcre2grep -o 'CB:Z:\K[^\t]*' small.bam |
 awk '{++c[[=10=]]} END {for (i in c) print i,c[i]}'

如果您有兴趣尝试加速 sed（尽管它可能不是最快的）：

sed 't a;s/CB:Z:/\n/;D;:a;s/\t/\n/;P;d' small.bam |
 awk '{++c[[=11=]]} END {for (i in c) print i,c[i]}'

_{以上语法与 GNU sed 兼容。}

Answer 5

关于 :

$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical

绝对没有必要将它们物理输出到磁盘并进行比较。只需将每个管道的输出传输到一个非常高速的哈希算法，该算法几乎不会增加时间（当输出足够大时，您最终可能会节省时间而不是转到磁盘。

我个人最喜欢的是 128 位模式的 xxhash，可通过 python pip 获得。它不是加密散列，但它甚至比 MD5 之类的东西快得多。此方法还允许进行无忧比较，因为它的基准时间也将执行准确性检查。

Answer 6

重新分级基于 AWK 的解决方案，我注意到很少有人利用 FS。

我对 BAM 格式不太熟悉。如果CB每行只出现一次，那么

mawk/mawk2/gawk -b 'BEGIN { FS = "CB:Z:"; 

   }  ~ /^[ACGT]/ {     # if FS never matches,  would be beyond
                          # end of line, then this would just match
                          # against  null string, & eval to false

      seen[substr(, 1, -1 + match(, /[^ACGT]|$/))]++ 

   } END { for (x in seen) { print seen[x] " " x } }'

如果出现不止一次，则将其更改为任何大于 1 的字段的循环。此版本使用最懒惰的评估模型来加快速度，然后执行所有 uniq -c 项目。

虽然这与上面的最佳答案非常相似，但通过让 FS 预先拆分字段，它会导致 match() 和 substr() 做更少的工作。我只是在遗传序列之后匹配 1 个字符，并直接使用其 return，负 1 作为子串长度，并一起跳过 RSTART 或 RLENGTH。

如何加速在非常大的单单元 BAM 文件上使用 Regex 的 sed

How to speed-up sed that uses Regex on very large single cell BAM file

unix

linux

awk

sed

bioinformatics