使用特定符号之间匹配的 ID 字符串过滤多 fasta 文件
Filter multi fasta file with matching ID string between specific symbols
我在这个问题上卡住了一段时间,尽管我发现了一组不同的类似问题 none 完全符合我的问题或解决了问题。所以这是交易:
我有一个 input.fasta
,格式如下:
>sp|O42363|APOA1_DANRE Apolipoprotein A-I OS=Danio rerio OX=7955 GN=apoa1 PE=2 SV=1
MKFVALALTLLLALGSQANLFQADAPTQLEHYKAAALVYLNQVKDQAEKALDNLDGTDYEQYKLQLSESLTKLQEYAQTTSQALTPYAETISTQLMENTKQLRERVMTDVEDLRSKLEPHRAELYTALQKHIDEYREKLEPVFQEYSALNRQNAEQLRAKLEPLMDDIRKAFESNIEETKSKVVPMVEAVRTKLTERLEDLRTMAAPYAEEYKEQLVKAVEEAREKIAPHTQDLQTRMEPYMENVRTTFAQMYETIAKAIQA
>sp|Q90260|ASL1B_DANRE Achaete-scute homolog 1b OS=Danio rerio OX=7955 GN=ascl1b PE=2 SV=1
MEATVVATTQLTQDSFYQPFSESLEKQDRECKVLKRQRSSSPELLRCKRRLTFNGLGYTIPQQQPMAVARRNERERNRVKQVNMGFQTLRQHVPNGAANKKMSKVETLRSAVEYIRALQQLLDEHDAVSAVLQCGVPSPSVSNAYSAGPESPHSAYSSDEGSYEHLSSEEQELLDFTTWFDRYESGASMATKDWC
>sp|Q6TH01|C10_DANRE Protein C10 OS=Danio rerio OX=7955 GN=si:dkey-29f10.1 PE=2 SV=1
MASAPAQQPTLTVEQARVVLSEVIQAFSVPENAARMEEARESACNDMGKMLQLVLPVATQIQQEVIKAYGFNNEGEGVLKFARLVKMYETQDPEIAAMSVKLKSLLLPPLSTPPIGSGIPTS
>sp|Q6PFL6|CCD43_DANRE Coiled-coil domain-containing protein 43 OS=Danio rerio OX=7955 GN=ccdc43 PE=2 SV=1
MAAPEQIAGEFENWLNERLDSLEVDREVYGAYILGVLQEEESDEEQKDALQGILSAFLEEETLEEVCQEILKQWTECCSRSGAKSNQADAEVQAIASLIEKQAQIVVKQKEVSEDAKKRKEAVLAQYANVTDDEDEAEEEEQVPVGIPSDKSLFKNTNVEDVLNRRKLQRDQAKEDAQKKKEQDKMQREKDKLSKQERKDKEKKRTQKGERKR
>sp|P0C7U5|C5AR1_DANRE C5a anaphylatoxin chemotactic receptor 1 OS=Danio rerio OX=7955 GN=c5ar1 PE=3 SV=1
MDDNNSDWTSYDFGNDTIPSPNEISLSHIGTRHWITLVCYGIVFLLGVPGNALVVWVTGFRMPNSVNAQWFLNLAIADLLCCLSLPILMVPLAQDQHWPFGALACKLFSGIFYMMMYCSVLLLVVISLDRFLLVTKPVWCQNNRQPRQARILCFIIWILGLLGSSPYFAHMEIQHHSETKTVCTGSYSSLGHAWAITIIRSFLFFLLPFLIICISHWKVYHMTSSGRRQRDKSSRTLRVILALVLGFFLCWTPLH
和一个 ids.txt
列表,格式如下:
Q90260
Q6PFL6
我想提取所有带有 header 的 fasta 序列,其中 ids.txt
的 ID 是 header 的元素。
我试过 grep -w -A 2 -Ff id_list.txt input.fasta --no-group-separator > out.fasta
但没用。
理想情况下,我想通过正则表达式检查以 >sp
开头的每一行的两个 |
之间的字符串是否与我的 idx.txt
中的任何 ID 匹配。如果是这样,将 header 和 fasta 存储在 out.fasta
.
所以 out.fasta
看起来像这样:
>sp|Q90260|ASL1B_DANRE Achaete-scute homolog 1b OS=Danio rerio OX=7955 GN=ascl1b PE=2 SV=1
MEATVVATTQLTQDSFYQPFSESLEKQDRECKVLKRQRSSSPELLRCKRRLTFNGLGYTIPQQQPMAVARRNERERNRVKQVNMGFQTLRQHVPNGAANKKMSKVETLRSAVEYIRALQQLLDEHDAVSAVLQCGVPSPSVSNAYSAGPESPHSAYSSDEGSYEHLSSEEQELLDFTTWFDRYESGASMATKDWC
>sp|Q6PFL6|CCD43_DANRE Coiled-coil domain-containing protein 43 OS=Danio rerio OX=7955 GN=ccdc43 PE=2 SV=1
MAAPEQIAGEFENWLNERLDSLEVDREVYGAYILGVLQEEESDEEQKDALQGILSAFLEEETLEEVCQEILKQWTECCSRSGAKSNQADAEVQAIASLIEKQAQIVVKQKEVSEDAKKRKEAVLAQYANVTDDEDEAEEEEQVPVGIPSDKSLFKNTNVEDVLNRRKLQRDQAKEDAQKKKEQDKMQREKDKLSKQERKDKEKKRTQKGERKR
我很确定这可以通过 awk
或 grep
来表达,但我是 bash 的新手,所以我现在很难过。
提前致谢! :)
使用 join
和 sort
:
join -t \| -2 2 -o 2.1,2.2,2.3 <(sort ids.txt) <(sort -t \| -k2 input.fasta)
假设 input.fasta 中没有额外的 |
字符并且输出行的顺序不重要。
与awk
:
awk -F'[|]' 'NR==FNR{ids[[=10=]];next} in ids' ids.txt input.fasta
解释:
我将文件 ids.txt 和 input.fasta 作为输入文件传递给 awk。顺序很重要。 -F'[|]'
将输入字段分隔符设置为 |
.
脚本:
# NR is the overall record (line) number
# FNR is the record (line) number in the current input file
NR==FNR { # True as long as we are reading the first input file
ids[[=11=]] # Create a key in ids for every id from ids.txt
next # Don't process further actions
}
# Because of the 'next' statement above, we'll reach this point only
# when reading the second input file (input.fasta)
in ids # Print the current line if the second field
# was found in the ids lookup
输出:
>sp|Q90260|ASL1B_DANRE Achaete-scute homolog 1b OS=Danio rerio OX=7955 GN=ascl1b PE=2 SV=1
>sp|Q6PFL6|CCD43_DANRE Coiled-coil domain-containing protein 43 OS=Danio rerio OX=7955 GN=ccdc43 PE=2 SV=1
更新:原来你还想打印匹配项下方的内容。这可以这样实现:
BEGIN {
FS="[|]"
}
# NR is the overall record (line) number
# FNR is the record (line) number in the current input file
NR==FNR { # True as long as we are reading the first input file
ids[[=13=]] # Create a key in ids for every id from ids.txt
next # Don't process further actions
}
# Because of the 'next' statement above, we'll reach this point only
# when reading the second input file (input.fasta)
in ids {
# set or reset a variable p to 2 if the second field
# was found in the ids lookup
p = 2
}
# Decrement the variable p on every iteration and check if it
# is greater than 0 after that. If that's true, awk will print
# the current line
p--> 0
我在这个问题上卡住了一段时间,尽管我发现了一组不同的类似问题 none 完全符合我的问题或解决了问题。所以这是交易:
我有一个 input.fasta
,格式如下:
>sp|O42363|APOA1_DANRE Apolipoprotein A-I OS=Danio rerio OX=7955 GN=apoa1 PE=2 SV=1
MKFVALALTLLLALGSQANLFQADAPTQLEHYKAAALVYLNQVKDQAEKALDNLDGTDYEQYKLQLSESLTKLQEYAQTTSQALTPYAETISTQLMENTKQLRERVMTDVEDLRSKLEPHRAELYTALQKHIDEYREKLEPVFQEYSALNRQNAEQLRAKLEPLMDDIRKAFESNIEETKSKVVPMVEAVRTKLTERLEDLRTMAAPYAEEYKEQLVKAVEEAREKIAPHTQDLQTRMEPYMENVRTTFAQMYETIAKAIQA
>sp|Q90260|ASL1B_DANRE Achaete-scute homolog 1b OS=Danio rerio OX=7955 GN=ascl1b PE=2 SV=1
MEATVVATTQLTQDSFYQPFSESLEKQDRECKVLKRQRSSSPELLRCKRRLTFNGLGYTIPQQQPMAVARRNERERNRVKQVNMGFQTLRQHVPNGAANKKMSKVETLRSAVEYIRALQQLLDEHDAVSAVLQCGVPSPSVSNAYSAGPESPHSAYSSDEGSYEHLSSEEQELLDFTTWFDRYESGASMATKDWC
>sp|Q6TH01|C10_DANRE Protein C10 OS=Danio rerio OX=7955 GN=si:dkey-29f10.1 PE=2 SV=1
MASAPAQQPTLTVEQARVVLSEVIQAFSVPENAARMEEARESACNDMGKMLQLVLPVATQIQQEVIKAYGFNNEGEGVLKFARLVKMYETQDPEIAAMSVKLKSLLLPPLSTPPIGSGIPTS
>sp|Q6PFL6|CCD43_DANRE Coiled-coil domain-containing protein 43 OS=Danio rerio OX=7955 GN=ccdc43 PE=2 SV=1
MAAPEQIAGEFENWLNERLDSLEVDREVYGAYILGVLQEEESDEEQKDALQGILSAFLEEETLEEVCQEILKQWTECCSRSGAKSNQADAEVQAIASLIEKQAQIVVKQKEVSEDAKKRKEAVLAQYANVTDDEDEAEEEEQVPVGIPSDKSLFKNTNVEDVLNRRKLQRDQAKEDAQKKKEQDKMQREKDKLSKQERKDKEKKRTQKGERKR
>sp|P0C7U5|C5AR1_DANRE C5a anaphylatoxin chemotactic receptor 1 OS=Danio rerio OX=7955 GN=c5ar1 PE=3 SV=1
MDDNNSDWTSYDFGNDTIPSPNEISLSHIGTRHWITLVCYGIVFLLGVPGNALVVWVTGFRMPNSVNAQWFLNLAIADLLCCLSLPILMVPLAQDQHWPFGALACKLFSGIFYMMMYCSVLLLVVISLDRFLLVTKPVWCQNNRQPRQARILCFIIWILGLLGSSPYFAHMEIQHHSETKTVCTGSYSSLGHAWAITIIRSFLFFLLPFLIICISHWKVYHMTSSGRRQRDKSSRTLRVILALVLGFFLCWTPLH
和一个 ids.txt
列表,格式如下:
Q90260
Q6PFL6
我想提取所有带有 header 的 fasta 序列,其中 ids.txt
的 ID 是 header 的元素。
我试过 grep -w -A 2 -Ff id_list.txt input.fasta --no-group-separator > out.fasta
但没用。
理想情况下,我想通过正则表达式检查以 >sp
开头的每一行的两个 |
之间的字符串是否与我的 idx.txt
中的任何 ID 匹配。如果是这样,将 header 和 fasta 存储在 out.fasta
.
所以 out.fasta
看起来像这样:
>sp|Q90260|ASL1B_DANRE Achaete-scute homolog 1b OS=Danio rerio OX=7955 GN=ascl1b PE=2 SV=1
MEATVVATTQLTQDSFYQPFSESLEKQDRECKVLKRQRSSSPELLRCKRRLTFNGLGYTIPQQQPMAVARRNERERNRVKQVNMGFQTLRQHVPNGAANKKMSKVETLRSAVEYIRALQQLLDEHDAVSAVLQCGVPSPSVSNAYSAGPESPHSAYSSDEGSYEHLSSEEQELLDFTTWFDRYESGASMATKDWC
>sp|Q6PFL6|CCD43_DANRE Coiled-coil domain-containing protein 43 OS=Danio rerio OX=7955 GN=ccdc43 PE=2 SV=1
MAAPEQIAGEFENWLNERLDSLEVDREVYGAYILGVLQEEESDEEQKDALQGILSAFLEEETLEEVCQEILKQWTECCSRSGAKSNQADAEVQAIASLIEKQAQIVVKQKEVSEDAKKRKEAVLAQYANVTDDEDEAEEEEQVPVGIPSDKSLFKNTNVEDVLNRRKLQRDQAKEDAQKKKEQDKMQREKDKLSKQERKDKEKKRTQKGERKR
我很确定这可以通过 awk
或 grep
来表达,但我是 bash 的新手,所以我现在很难过。
提前致谢! :)
使用 join
和 sort
:
join -t \| -2 2 -o 2.1,2.2,2.3 <(sort ids.txt) <(sort -t \| -k2 input.fasta)
假设 input.fasta 中没有额外的 |
字符并且输出行的顺序不重要。
与awk
:
awk -F'[|]' 'NR==FNR{ids[[=10=]];next} in ids' ids.txt input.fasta
解释:
我将文件 ids.txt 和 input.fasta 作为输入文件传递给 awk。顺序很重要。 -F'[|]'
将输入字段分隔符设置为 |
.
脚本:
# NR is the overall record (line) number
# FNR is the record (line) number in the current input file
NR==FNR { # True as long as we are reading the first input file
ids[[=11=]] # Create a key in ids for every id from ids.txt
next # Don't process further actions
}
# Because of the 'next' statement above, we'll reach this point only
# when reading the second input file (input.fasta)
in ids # Print the current line if the second field
# was found in the ids lookup
输出:
>sp|Q90260|ASL1B_DANRE Achaete-scute homolog 1b OS=Danio rerio OX=7955 GN=ascl1b PE=2 SV=1
>sp|Q6PFL6|CCD43_DANRE Coiled-coil domain-containing protein 43 OS=Danio rerio OX=7955 GN=ccdc43 PE=2 SV=1
更新:原来你还想打印匹配项下方的内容。这可以这样实现:
BEGIN {
FS="[|]"
}
# NR is the overall record (line) number
# FNR is the record (line) number in the current input file
NR==FNR { # True as long as we are reading the first input file
ids[[=13=]] # Create a key in ids for every id from ids.txt
next # Don't process further actions
}
# Because of the 'next' statement above, we'll reach this point only
# when reading the second input file (input.fasta)
in ids {
# set or reset a variable p to 2 if the second field
# was found in the ids lookup
p = 2
}
# Decrement the variable p on every iteration and check if it
# is greater than 0 after that. If that's true, awk will print
# the current line
p--> 0