从 fasta 文件中删除多个序列

Question

我有一个包含两行字符序列的文本文件：a header 和下一行中的序列本身。文件结构如下：

>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

在另一个文件中，我有一个包含 header 个我想删除的序列的列表，如下所示：

>header1
>header5
>header12
[...]
>header145

想法是从第一个文件中删除这些序列，所以所有这些 headers+下面的行。我像下面这样使用 sed 做到了，

while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt

它可以工作，但需要很长时间，因为我用 sed 多次加载整个文件，而且它很大。关于如何加快此过程的任何想法？

Answer 1

一个选项是创建一个长 sed 表达式：

sedcmd=
while read line; do sedcmd+="/^$line$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt

这只会读取文件一次。请注意，我将 ^ 和 $ 添加到 sed 模式（因此 >header1 不匹配 >header123...）

如果您有数千个文件，使用文件（如@daniu 建议）可能更好，因为使用此方法可能会达到命令行最大计数。

Answer 2

使用第二个文件中的删除命令创建脚本：

sed 's#\(.*\)#//,+1d#' secondFile.txt > commands.sed

然后将该文件应用到第一个

sed -f commands.sed firstFile.txt

Answer 3

这个 awk 可能适合你：

awk 'FNR==NR{a[[=10=]]=1;next}a[[=10=]]{getline;next}1' input2 input1

Answer 4

$ awk 'NR==FNR{a[[=10=]];next} [=10=] in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

c 是您要从刚刚匹配的行开始跳过的行数。参见。

或者：

$ awk 'NR==FNR{a[[=11=]];next} /^>/{f=([=11=] in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

f 是最近读取的 >... 行是否在目标数组 a[] 中找到。 f=([=16=] in a ? 1 : 0) 可以缩写为 f=([=17=] in a) 但为了清楚起见，我更喜欢三元表达式。

第一个脚本依赖于您知道每条记录有多少行，而第二个脚本依赖于每条以 > 开头的记录。如果你两者都知道，那么你使用哪一个是一种风格选择。

Answer 5

你可以使用这个awk:

awk 'NR == FNR{seen[[=10=]]; next} /^>/{p = !([=10=] in seen)} p' hdr.txt details.txt

Answer 6

您的问题很容易回答，但在处理通用 fasta 文件时对您没有帮助。 Fasta 文件有一个序列 header 后跟一个或多个可以连接起来表示序列的行。 Fastafile-format大致遵循以下规则：

The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.

Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).

The sequence can span multiple lines.

A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.

大多数提出的方法将在 multi-fasta 和 multi-line 序列

上失败

以下将始终有效：

awk '(NR==FNR) { toRemove[]; next }
     /^>/ { p=1; for(h in toRemove) if ( h ~ [=10=]) p=0 }
    p' headers.txt file.fasta

这与 EdMorton and Anubahuva 的答案非常相似，但这里的区别在于文件 headers.txt 只能包含 header.

的一部分

Answer 7

试试 gnu sed，

sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f -  first_file.txt

在两个脚本中添加 time 命令来比较速度，
看看 time while read line;do... 和 time sed -.... 结果在我的测试中这是在不到 OP 的一半时间内完成的

从 fasta 文件中删除多个序列

Remove multiple sequences from fasta file

bash

awk

sed

fasta