使用 sed 从一行中一次提取两段文本

Question

好的，我在 SO 上找到了类似的答案，但我的 sed / grep / awk fu 太差了，我无法完全适应我的任务。也就是说，给定此文件 "test.gff":

accn|CP014704   RefSeq  CDS 403 915 .   +   0   ID=AZ909_00020;locus_tag=AZ909_00020;product=transcriptional regulator
accn|CP014704   RefSeq  CDS 928 2334    .   +   0   ID=AZ909_00025;locus_tag=AZ909_00025;product=FAD/NAD(P)-binding oxidoreductase
accn|CP014704   RefSeq  CDS 31437   32681   .   +   0   ID=AZ909_00145;locus_tag=AZ909_00145;product=gamma-glutamyl-phosphate reductase;gene=proA
accn|CP014704   RefSeq  CDS 2355    2585    .   +   0   ID=AZ909_00030;locus_tag=AZ909_00030;product=hypothetical protein

我想提取两个值 1) "ID=" 右侧的文本直到分号和 2) "product=" 右侧的文本直到行尾或分号（因为您可以看到其中一行也有一个 "gene=" 值。

所以我想要这样的东西：

ID    product
AZ909_00020    transcriptional regulator
AZ909_00025    FAD/NAD(P)-binding oxidoreductase
AZ909_00145    gamma-glutamyl-phosphate reductase

据我所知：

printf "ID\tproduct\n"

sed -nr 's/^.*ID=(.*);.*product=(.*);/\t\p/' test.gff

谢谢！

Answer 1

尝试以下操作：

sed 's/.*ID=\([^;]*\);.*product=\([^;]*\).*/\t/' test.gff

相比你的尝试，我改变了你匹配产品的方式。因为我们不知道该字段是以 ; 还是 EOL 结尾的，所以我们只匹配尽可能多的非 ; 字符。我还在末尾添加了一个 .* 以匹配产品后面任何可能的剩余字符。这样，当我们进行替换时，整行都会匹配，我们将能够完全重写它。

如果您想要稍微更健壮的东西，这里有一个 perl 单行代码：

perl -nle '($id)=/ID=([^;]*)/; ($prod)=/product=([^;]*)/; print "$id\t$prod"' test.gff

这将使用正则表达式分别提取两个字段。它会正常工作，即使字段以相反的顺序出现。

Answer 2

如果你有 GNU-awk aka gawk 可以随意使用，你可以尝试下面的方法：

用awk

gawk 'BEGIN{printf "ID\tProduct%s",RS}
     {printf "%s\t%s%s",gensub(/^.*[[:blank:]]+ID=([^;]*);.*$/,"\1","1",[=10=]),
      gensub(/^.*;product=([^;]*)[;]*.*$/,"\1","1",[=10=]),RS}
    ' test.gff | expand -t20

输出

ID                  Product
AZ909_00020         transcriptional regulator
AZ909_00025         FAD/NAD(P)-binding oxidoreductase
AZ909_00145         gamma-glutamyl-phosphate reductase
AZ909_00030         hypothetical protein

正如您所注意到的，两个 gensub 正在这里做繁重的工作。

在 gensub(/^.*[[:blank:]]+ID=([^;]*);.*$/,"\1","1",[=16=]) 中，除了包含在 ID= 和随后的第一个分号之间的内容之外的所有内容都从记录中删除（参见 [=18=]）。注意 gensub 不会修改记录本身，它只是 returns 打印的修改后的字符串。
在 gensub(/^.*;product=([^;]*)[;]*.*$/,"\1","1",[=20=]) 中，除了 product= 和第一个分号（或末尾）之间的内容之外，类似的任何内容都被删除了
最后，我们使用 expand -t 增加制表符宽度以获得格式良好的输出。
由于硬编码 \n 是一种不好的做法，我使用内置记录分隔符变量 RS 在每条记录后打印换行符。

使用类似逻辑的 sed 解决方案如下：

使用 sed

printf "%-20s%s\n" "ID" "Product"
sed -E "s/^.*[[:blank:]]+ID=([^;]*);.*;product=([^;]*)[;]*.*$/\1\t\2/" 39322581 | expand -t20

输出

ID                  Product
AZ909_00020         transcriptional regulator
AZ909_00025         FAD/NAD(P)-binding oxidoreductase
AZ909_00145         gamma-glutamyl-phosphate reductase
AZ909_00030         hypothetical protein

考虑到已经为您提供了一个简短而优雅的 perl 解决方案，如果您可以使用 perl，您也可以考虑使用它。

^{旁注：将 \n 与 printf 一起使用会降低脚本的可移植性}

Answer 3

awk 中的另一个。我们增加 ”;”到字段分隔符 (FS) 列表，剥离字符串 "ID=" 和 "product=" 并打印字段 9 和 10：

$ awk -F'([ \t\n]+|;)' 'BEGIN{print "ID" OFS "Product"}{gsub(/product=|ID=/,""); print ,}' test.gff
ID Product
AZ909_00020 locus_tag=AZ909_00020
AZ909_00025 locus_tag=AZ909_00025
AZ909_00145 locus_tag=AZ909_00145
AZ909_00030 locus_tag=AZ909_00030

Answer 4

您的正则表达式的主要问题是使用 .* 而不是 [^;]*，因为 .* 将匹配所有字符，但您只想匹配 non-semi-colons。试试这个：

$ sed -E 's/.*ID=([^;]+).*product=([^;]+).*/\t/' file
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

或：

$ awk -F'[=;]' -v OFS='\t' '{print , }' file
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

您也可以使用 awk 轻松提取 header 值：

$ awk -F'[=;]' -v OFS='\t' 'NR==1{sub(/.* /,"",); print , } {print , }' file
ID      product
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

使用 sed 从一行中一次提取两段文本

use sed to extract two pieces of text at once from a line

bash

grep

text

sed

gff