使用 GNU parallel 增强 grep 搜索
Boosting the grep search using GNU parallel
我正在使用以下 grep 脚本输出所有不匹配的模式:
grep -oFf patterns.txt large_strings.txt | grep -vFf - patterns.txt > unmatched_patterns.txt
patterns 文件包含以下 12 个字符长的子字符串(部分实例如下所示):
6b6c665d4f44
8b715a5d5f5f
26364d605243
717c8a919aa2
large_strings 文件包含超长字符串,长度约为 20-1 亿个字符(一小段字符串如下所示):
121b1f212222212123242223252b36434f5655545351504f4e4e5056616d777d80817d7c7b7a7a7b7c7d7f8997a0a2a2a3a5a5a6a6a6a6a6a7a7babbbcbebebdbcbcbdbdbdbdbcbcbcbcc2c2c2c2c2c2c2c2c4c4c4c3c3c3c2c2c3c3c3c3c3c3c3c3c2c2c1c0bfbfbebdbebebebfbfc0c0c0bfbfbfbebebdbdbdbcbbbbbababbbbbcbdbdbdbebebfbfbfbebdbcbbbbbbbbbcbcbcbcbcbcbcbcbcb8b8b8b7b7b6b6b6b8b8b9babbbbbcbcbbbabab9b9bababbbcbcbcbbbbbababab9b8b7b6b6b6b6b7b7b7b7b7b7b7b7b7b7b6b6b5b5b6b6b7b7b7b7b8b8b9b9b9b9b9b8b7b7b6b5b5b5b5b5b4b4b3b3b3b6b5b4b4b5b7b8babdbebfc1c1c0bfbec1c2c2c2c2c1c0bfbfbebebebebfc0c1c0c0c0bfbfbebebebebebebebebebebebebebdbcbbbbbab9babbbbbcbcbdbdbdbcbcbbbbbbbbbbbabab9b7b6b5b4b4b4b4b3b1aeaca9a7a6a9a9a9aaabacaeafafafafafafafafafb1b2b2b2b2b1b0afacaaa8a7a5a19d9995939191929292919292939291908f8e8e8d8c8b8a8a8a8a878787868482807f7d7c7975716d6b6967676665646261615f5f5e5d5b5a595957575554525
我们如何加速上述脚本(gnu parallel、xargs、fgrep 等)?我尝试使用 --pipepart
和 --block
但它不允许您通过管道传输两个 grep 命令。
顺便说一句,这些都是十六进制字符串和模式。
下面的工作代码比传统的 grep 快一点:
rg -oFf patterns.txt large_strings.txt | rg -vFf - patterns.txt > unmatched_patterns.txt
grep
花了一个小时完成模式匹配过程,而 ripgrep
花了大约 45 分钟。
如果不需要使用grep
试试:
build_k_mers() {
k=""
slot=""
perl -ne 'for $n (0..(length $_)-'"$k"') {
$prefix = substr($_,$n,2);
$fh{$prefix} or open $fh{$prefix}, ">>", "tmp/kmer.$prefix.'"$slot"'";
$fh = $fh{$prefix};
print $fh substr($_,$n,'"$k"'),"\n"
}'
}
export -f build_k_mers
rm -rf tmp
mkdir tmp
export LC_ALL=C
# search strings must be sorted for comm
parsort patterns.txt | awk '{print >>"tmp/patterns."substr(,1,2)}' &
# make shorter lines: Insert \n(last 12 char before \n) for every 32k
# This makes it easier for --pipepart to find a newline
# It will not change the kmers generated
perl -pe 's/(.{32000})(.{12})/\n/g' large_strings.txt > large_lines.txt
# Build 12-mers
parallel --pipepart --block -1 -a large_lines.txt 'build_k_mers 12 {%}'
# -j10 and 20s may be adjusted depending on hardware
parallel -j10 --delay 20s 'parsort -u tmp/kmer.{}.* > tmp/kmer.{}; rm tmp/kmer.{}.*' ::: `perl -e 'map { printf "%02x ",$_ } 0..255'`
wait
parallel comm -23 {} {=s/patterns./kmer./=} ::: tmp/patterns.??
我已经在 patterns.txt
: 9GBytes/725937231 行,large_strings.txt
: 19GBytes/184 行和我的 64 核机器上测试了它在 3 小时内完成。
我正在使用以下 grep 脚本输出所有不匹配的模式:
grep -oFf patterns.txt large_strings.txt | grep -vFf - patterns.txt > unmatched_patterns.txt
patterns 文件包含以下 12 个字符长的子字符串(部分实例如下所示):
6b6c665d4f44
8b715a5d5f5f
26364d605243
717c8a919aa2
large_strings 文件包含超长字符串,长度约为 20-1 亿个字符(一小段字符串如下所示):
121b1f212222212123242223252b36434f5655545351504f4e4e5056616d777d80817d7c7b7a7a7b7c7d7f8997a0a2a2a3a5a5a6a6a6a6a6a7a7babbbcbebebdbcbcbdbdbdbdbcbcbcbcc2c2c2c2c2c2c2c2c4c4c4c3c3c3c2c2c3c3c3c3c3c3c3c3c2c2c1c0bfbfbebdbebebebfbfc0c0c0bfbfbfbebebdbdbdbcbbbbbababbbbbcbdbdbdbebebfbfbfbebdbcbbbbbbbbbcbcbcbcbcbcbcbcbcb8b8b8b7b7b6b6b6b8b8b9babbbbbcbcbbbabab9b9bababbbcbcbcbbbbbababab9b8b7b6b6b6b6b7b7b7b7b7b7b7b7b7b7b6b6b5b5b6b6b7b7b7b7b8b8b9b9b9b9b9b8b7b7b6b5b5b5b5b5b4b4b3b3b3b6b5b4b4b5b7b8babdbebfc1c1c0bfbec1c2c2c2c2c1c0bfbfbebebebebfc0c1c0c0c0bfbfbebebebebebebebebebebebebebdbcbbbbbab9babbbbbcbcbdbdbdbcbcbbbbbbbbbbbabab9b7b6b5b4b4b4b4b3b1aeaca9a7a6a9a9a9aaabacaeafafafafafafafafafb1b2b2b2b2b1b0afacaaa8a7a5a19d9995939191929292919292939291908f8e8e8d8c8b8a8a8a8a878787868482807f7d7c7975716d6b6967676665646261615f5f5e5d5b5a595957575554525
我们如何加速上述脚本(gnu parallel、xargs、fgrep 等)?我尝试使用 --pipepart
和 --block
但它不允许您通过管道传输两个 grep 命令。
顺便说一句,这些都是十六进制字符串和模式。
下面的工作代码比传统的 grep 快一点:
rg -oFf patterns.txt large_strings.txt | rg -vFf - patterns.txt > unmatched_patterns.txt
grep
花了一个小时完成模式匹配过程,而 ripgrep
花了大约 45 分钟。
如果不需要使用grep
试试:
build_k_mers() {
k=""
slot=""
perl -ne 'for $n (0..(length $_)-'"$k"') {
$prefix = substr($_,$n,2);
$fh{$prefix} or open $fh{$prefix}, ">>", "tmp/kmer.$prefix.'"$slot"'";
$fh = $fh{$prefix};
print $fh substr($_,$n,'"$k"'),"\n"
}'
}
export -f build_k_mers
rm -rf tmp
mkdir tmp
export LC_ALL=C
# search strings must be sorted for comm
parsort patterns.txt | awk '{print >>"tmp/patterns."substr(,1,2)}' &
# make shorter lines: Insert \n(last 12 char before \n) for every 32k
# This makes it easier for --pipepart to find a newline
# It will not change the kmers generated
perl -pe 's/(.{32000})(.{12})/\n/g' large_strings.txt > large_lines.txt
# Build 12-mers
parallel --pipepart --block -1 -a large_lines.txt 'build_k_mers 12 {%}'
# -j10 and 20s may be adjusted depending on hardware
parallel -j10 --delay 20s 'parsort -u tmp/kmer.{}.* > tmp/kmer.{}; rm tmp/kmer.{}.*' ::: `perl -e 'map { printf "%02x ",$_ } 0..255'`
wait
parallel comm -23 {} {=s/patterns./kmer./=} ::: tmp/patterns.??
我已经在 patterns.txt
: 9GBytes/725937231 行,large_strings.txt
: 19GBytes/184 行和我的 64 核机器上测试了它在 3 小时内完成。