grep/sed/awk 解析文本文件以在模式匹配后打印多行并转换为一行
grep/sed/awk parsing text file to print multiple rows after a pattern matches and convert to one row
我想使用 grep/awk/sed 来解析包含多个基因的各种描述的文本文件。我希望每一行代表一个基因描述。
现在我想将自动和简明描述提取到单个 txt 文件中,每行代表单个基因的描述。
下载文件
wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz
我已经能够使用下面的代码提取所需的文本并拥有单独的文本文件。但是,我无法将文本输出成单行。
awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt
#do this for the next section automated description
awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt
#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt
有人可以帮忙吗?
1 个基因描述的当前文本结构
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan
and dauer development, and likely functions as the sole adaptor subunit for the
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates
insulin-like signaling, it is not absolutely required for insulin-like signaling
under most conditions.
1 个基因描述所需的文本结构
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.
谢谢你,何塞。
OP 当前 awk
代码的一些小改动:
awk '
/Concise description:/ { flag=1; pfx="" }
/Automated description/ { flag=0; print "" } # close out current printf line out output
flag { printf "%s%s",pfx,[=10=]; pfx=" " } # assuming appended lines are separated by a single space
' file
注意: 我不确定我是否理解 OP 当前对 grep -v
的使用,因为我们没有一组示例输入来证明需要grep -v
... ?
对于提供的小样本生成:
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.
假设:
- OP 需要解析输入文件两次(针对两个不同的文本块)
- 两个不同的文本块不重叠
- 输入文件中可能有多个
Concise
或Automated
文本块,所有输入都将路由到两个输出文件之一
我们可以将 OP 当前的 2x awk
脚本合并为一个,例如:
awk '
function close_line() { if (outfile) print "" > outfile } # close out prior printf line of output?
/Concise description:/ { close_line()
outfile="WB283_concise.txt"
pfx=""
}
/Automated description:/ { close_line()
outfile="WB283_automated.txt"
pfx=""
}
/Gene class description/ { close_line()
outfile=""
}
outfile { printf "%s%s", pfx, [=12=] > outfile
pfx=" "
}
END { close_line() }
' file
我可以建议一个稍微修改的解决方案(不完全是所要求的,但有可能有用的想法):
awk '
/WBGene/ { printf("\n%s: ", ) }
/Concise description/ { flag = 1; =="" }
/=/ { flag = 0 }
/^.* description/ { flag = 0 }
flag { printf " %s", [=10=] }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt
我们的想法是过滤掉字符串“Concise description”,因为无论如何这是我们要查找的内容。基因名称打印在第一列,因为许多“简明描述”不包括名称。
输出格式是每个基因一行,以其名称(+冒号)开头,后跟“纯”简洁描述。
顺便说一句:如果你想创建第二个输出,在每行中使用“自动描述”,将第二个 awk-line 从 /Concise description/
更改为 /Automated description/
我想使用 grep/awk/sed 来解析包含多个基因的各种描述的文本文件。我希望每一行代表一个基因描述。
现在我想将自动和简明描述提取到单个 txt 文件中,每行代表单个基因的描述。
下载文件
wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz
我已经能够使用下面的代码提取所需的文本并拥有单独的文本文件。但是,我无法将文本输出成单行。
awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt
#do this for the next section automated description
awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt
#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt
有人可以帮忙吗?
1 个基因描述的当前文本结构
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan
and dauer development, and likely functions as the sole adaptor subunit for the
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates
insulin-like signaling, it is not absolutely required for insulin-like signaling
under most conditions.
1 个基因描述所需的文本结构
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.
谢谢你,何塞。
OP 当前 awk
代码的一些小改动:
awk '
/Concise description:/ { flag=1; pfx="" }
/Automated description/ { flag=0; print "" } # close out current printf line out output
flag { printf "%s%s",pfx,[=10=]; pfx=" " } # assuming appended lines are separated by a single space
' file
注意: 我不确定我是否理解 OP 当前对 grep -v
的使用,因为我们没有一组示例输入来证明需要grep -v
... ?
对于提供的小样本生成:
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.
假设:
- OP 需要解析输入文件两次(针对两个不同的文本块)
- 两个不同的文本块不重叠
- 输入文件中可能有多个
Concise
或Automated
文本块,所有输入都将路由到两个输出文件之一
我们可以将 OP 当前的 2x awk
脚本合并为一个,例如:
awk '
function close_line() { if (outfile) print "" > outfile } # close out prior printf line of output?
/Concise description:/ { close_line()
outfile="WB283_concise.txt"
pfx=""
}
/Automated description:/ { close_line()
outfile="WB283_automated.txt"
pfx=""
}
/Gene class description/ { close_line()
outfile=""
}
outfile { printf "%s%s", pfx, [=12=] > outfile
pfx=" "
}
END { close_line() }
' file
我可以建议一个稍微修改的解决方案(不完全是所要求的,但有可能有用的想法):
awk '
/WBGene/ { printf("\n%s: ", ) }
/Concise description/ { flag = 1; =="" }
/=/ { flag = 0 }
/^.* description/ { flag = 0 }
flag { printf " %s", [=10=] }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt
我们的想法是过滤掉字符串“Concise description”,因为无论如何这是我们要查找的内容。基因名称打印在第一列,因为许多“简明描述”不包括名称。
输出格式是每个基因一行,以其名称(+冒号)开头,后跟“纯”简洁描述。
顺便说一句:如果你想创建第二个输出,在每行中使用“自动描述”,将第二个 awk-line 从 /Concise description/
更改为 /Automated description/