Grep -A1 -f returns 结果多于应有的结果

Question

这是我的问题：

我有一个 fasta 文件，其中包含这样的遗传数据 (my.fasta):

>TR1|c0_g1_i1
GTCGAGCATGGTCTTGGTCATCT
>TR2|c0_g1_i1
AAGCAGTGCAGAAGAACTGGCGAA...

我还有一个名称列表，它是 my.fasta 文件的一个子集，我想为它们提取序列 (names.list):

TR3|c0_g1_i1
TR4|c0_g1_i1

我想得到的是这个：

>TR3|c0_g1_i1
CGGATCATGGTCTTGGTCAAAA
>TR4|c0_g1_i1
ATTGGGGGTTTTAAACTGGCGAA...

我在做：grep -A1 -f names.list my.fasta | grep -v "^--$" > new.fasta

但是！我的 names.list 中有 30566 个名字，当我这样做时 grep -c ">" new.fasta 我得到 31080.

整个 my.fasta 的名称列表：http://speedy.sh/PQpdD/names.myfasta.list 我想要的子集的名称列表：http://speedy.sh/kzqKr/names.list

谢谢！

Answer 1

您的一些名字相互包含，例如：TR74928|c6_g4_i1 和 TR74928|c6_g4_i10。所以 grep 会 return 你每行有一个以上的结果。

解决这个问题：

sed -e 's/^/>/g' names.list > copy.list

获取前缀为 > 的名称，就像在您的文件 my.fasta 中一样，然后：

grep -A1 -x -f copy.list my.fasta | grep -v "^--$" > new.fasta

精确匹配包含您的标识符的行。

-x, --line-regexp Select only those matches that exactly match the whole line. This option has the same effect as anchoring the expression with ^ and $.

一个更简单的解决方案是：

grep -A1 -w -f names.list my.fasta | grep -v "^--$" > new.fasta

但这只有在 my.fasta 中的标识符行不超过一个 "word"（标识符）时才有效。

-w, --word-regexp Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.

Grep -A1 -f returns 结果多于应有的结果

Grep -A1 -f returns more results than it should

bash

command-line

grep

fasta