grep/sed/awk 解析文本文件以在模式匹配后打印多行并转换为一行

Question

我想使用 grep/awk/sed 来解析包含多个基因的各种描述的文本文件。我希望每一行代表一个基因描述。

现在我想将自动和简明描述提取到单个 txt 文件中，每行代表单个基因的描述。

下载文件

wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz

我已经能够使用下面的代码提取所需的文本并拥有单独的文本文件。但是，我无法将文本输出成单行。

awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt

#do this for the next section automated description

awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt

#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt

有人可以帮忙吗？

1 个基因描述的当前文本结构

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan 
and dauer development, and likely functions as the sole adaptor subunit for the 
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates 
insulin-like signaling, it is not absolutely required for insulin-like signaling 
under most conditions.

1 个基因描述所需的文本结构

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.

谢谢你，何塞。

Answer 1

OP 当前 awk 代码的一些小改动：

awk '
/Concise description:/  { flag=1; pfx="" }
/Automated description/ { flag=0; print "" }                # close out current printf line out output
flag                    { printf "%s%s",pfx,[=10=]; pfx=" " }   # assuming appended lines are separated by a single space
' file

注意： 我不确定我是否理解 OP 当前对 grep -v 的使用，因为我们没有一组示例输入来证明需要grep -v ... ?

对于提供的小样本生成：

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide  3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.

假设：

OP 需要解析输入文件两次（针对两个不同的文本块）
两个不同的文本块不重叠
输入文件中可能有多个Concise或Automated文本块，所有输入都将路由到两个输出文件之一

我们可以将 OP 当前的 2x awk 脚本合并为一个，例如：

awk '
function close_line()    { if (outfile) print "" > outfile }      # close out prior printf line of output?

/Concise description:/   { close_line()
                           outfile="WB283_concise.txt"
                           pfx=""
                         }
/Automated description:/ { close_line()
                           outfile="WB283_automated.txt"
                           pfx=""
                         }
/Gene class description/ { close_line()
                           outfile=""
                         }
outfile                  { printf "%s%s", pfx, [=12=] > outfile
                           pfx=" "
                         }
END                      { close_line() }
' file

Answer 2

我可以建议一个稍微修改的解决方案（不完全是所要求的，但有可能有用的想法）：

awk '
/WBGene/              { printf("\n%s: ", ) }
/Concise description/ { flag = 1; =="" }
/=/                   { flag = 0 }
/^.* description/     { flag = 0 }
flag                  { printf " %s", [=10=] }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt

我们的想法是过滤掉字符串“Concise description”，因为无论如何这是我们要查找的内容。基因名称打印在第一列，因为许多“简明描述”不包括名称。

输出格式是每个基因一行，以其名称（+冒号）开头，后跟“纯”简洁描述。

顺便说一句：如果你想创建第二个输出，在每行中使用“自动描述”，将第二个 awk-line 从 /Concise description/ 更改为 /Automated description/

grep/sed/awk 解析文本文件以在模式匹配后打印多行并转换为一行

grep/sed/awk parsing text file to print multiple rows after a pattern matches and convert to one row

linux

bash

awk

grep

sed