过滤（或 'cut'）以 'OS=abc' 开头的列

Question

我的 .fasta 文件包含这种重复模式。

>sp|P20855|HBB_CTEGU Hemoglobin subunit beta OS=Ctenodactylus gundi OX=10166 GN=HBB PE=1 SV=1
asdfaasdfaasdfasdfa
>sp|Q00812|TRHBN_NOSCO Group 1 truncated hemoglobin GlbN OS=Nostoc commune OX=1178 GN=glbN PE=3 SV=1
asdfadfasdfaasdfasdfasdfasd
>sp|P02197|MYG_CHICK Myoglobin OS=Gallus gallus OX=9031 GN=MB PE=1 SV=4
aafdsdfasdfasdfa

我只想过滤掉包含“>”的行，然后过滤掉 'OS=' 之后和 'OX=' 之前的字符串（示例 line1=Ctenodactylus gundi）

第一部分（'>'）很简单：

grep '>' my.fasta | cut -d " " -f 3 >> species.txt

问题是字段数在 'OS=' 之前不是常量。

但是'OS='和'OX='之间的column/fields个数是2个

Answer 1

恕我直言 awk 在这里会更可行（因为它可以同时处理正则表达式和条件部分的打印），能否请您尝试以下操作。

awk '/^>/ && match([=10=],/OS=.*OX=/){print substr([=10=],RSTART+3,RLENGTH-6)}' Input_file

输出如下。

Ctenodactylus gundi
Nostoc commune
Gallus gallus

说明：为以上代码添加详细说明。

awk '                                    ##Starting awk program from here.
/^>/ && match([=12=],/OS=.*OX=/){            ##Checking condition if line starts from > AND matches regex OS=,*OX= means match from OS= till OX= in each line, if both conditions are TRUE.
  print substr([=12=],RSTART+3,RLENGTH-6)    ##Then print sub string of current line, whose starting point is RSTART+3 to till RLENGTH-6 of current line.
}
' Input_file                             ##Mentioning Input_file name here.

Answer 2

您可以使用 -P 选项启用基于 PCRE 的正则表达式匹配，并使用环视模式确保匹配包含在 OS= 和 OX= 之间：

grep '>' my.fasta | grep -oP '(?<=OS=).*(?=OX=)'

请注意，-P 选项仅适用于 GNU 版本的 grep，在某些环境中默认情况下可能不可用。

Answer 3

在每个 UNIX 机器上的任何 shell 中使用任何 awk：

$ awk -F' O[SX]=' '/^>/{print }' file
Ctenodactylus gundi
Nostoc commune
Gallus gallus

Answer 4

sed解法：

$ sed -nE '/>/ s/^.*OS=(.*) OX=.*$//p' .fasta
Ctenodactylus gundi
Nostoc commune
Gallus gallus

-n 这样模式 space 就不会被打印出来，除非被要求； -E（扩展正则表达式）以便我们可以使用子表达式和反向引用。 s 命令的 p 标志表示 "print the pattern space".

正则表达式应该匹配整行，在子表达式中挑出我们必须提取的片段。我假设 OX 前面正好有一个 space，它不能出现在输出中；可以调整 if/as 需要。

这假定所有以 > 开头的行将有一个 OS= ... 片段 紧接着 后跟一个 OX= ... 片段；如果没有，可以在 s 命令之前将其添加到 />/ 过滤器中。（顺便说一句 - 在 OS=... 和 OX= ... 之间可以有一些 OT= ... 片段 吗？）

问题是 - 您不想为每行输出包含一些标识符（可能是每行开头的 "label" 的一部分）吗？你有你要求的碎片 - 但你知道每个碎片来自哪里吗？

过滤（或 'cut'）以 'OS=abc' 开头的列

Filter (or 'cut') out column that begins with 'OS=abc'

regex

awk

sed

grep

cut