grep/sed/awk 将文本文件解析为一个文件,其中基因为行,描述为列
grep/sed/awk parsing text file into one file with genes as rows and descriptions as columns
我想使用 grep
/awk
/sed
来解析包含各种基因描述的文本文件。
下载文件
wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz
下面的示例文本:
WBGene00000001 aap-1 Y110A7A.10
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan
and dauer development, and likely functions as the sole adaptor subunit for the
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates
insulin-like signaling, it is not absolutely required for insulin-like signaling
under most conditions.
Automated description: Enables protein kinase binding activity. Involved in dauer
larval development; determination of adult lifespan; and insulin receptor signaling
pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and
neurons. Human ortholog(s) of this gene implicated in several diseases, including
Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency
36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit
3).
Gene class description: phosphoinositide kinase AdAPter subunit
=
WBGene00000002 aat-1 F27C8.1
Concise description: aat-1 encodes an amino acid transporter catalytic subunit;
when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1
is able to facilitate amino acid uptake and exchange, showing a relatively high
affinity for small and some large neutral amino acids; in addition, AAT-1 is able
to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus
expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface
of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly.
Automated description: Contributes to L-amino acid transmembrane transporter activity.
Involved in amino acid transmembrane transport. Located in plasma membrane. Part
of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons;
and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric
protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member
8).
Gene class description: Amino Acid Transporter
=
WBGene00000003 aat-2 F07C3.7
Concise description: aat-2 encodes a predicted amino acid transporter catalytic
subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however,
AAT-2 is not able to induce amino acid uptake.
Automated description: Predicted to enable L-amino acid transmembrane transporter
activity. Predicted to be involved in L-alpha-amino acid transmembrane transport
and L-amino acid transport. Predicted to be located in membrane. Predicted to be
integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria.
Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
Gene class description: Amino Acid Transporter
此文本文件包含每个基因名称(例如 WBGene00000004 aat-3 F52H2.2a),简明描述:,自动描述:,基因 class 描述:以等号“=”分隔。
我一直在尝试解析这个 txt 文件,所以我想我首先要分别提取每一列和每一行(基因)。
下面是我的代码
#genes
grep "WBGene" c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_WBgenes.txt
#gene class description:
awk '/Gene class description:/' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_geneclass.txt
#concise description
awk '
/Concise description:/ { flag=1; pfx="" }
/Automated description/ { flag=0; print "" }
flag { printf "%s%s",pfx,[=13=]; pfx=" " } # assuming appended lines are separated by a single space
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_concise.txt
#automated description
awk '
/Automated description:/ { flag=1; pfx="" }
/Gene class description:/ { flag=0; print "" }
flag { printf "%s%s",pfx,[=13=]; pfx=" " } # assuming appended lines are separated by a single space
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_automated.txt
我的问题:有没有办法结合我的 code/or 新代码来更好地解决我的问题?
我想提取每个基因名称、简明描述:、自动描述:和基因 class 描述:在单独的列中,每行代表一个基因。
我想创建一个 txt 文件,其中包含每一行作为一个基因,每一列包含描述选项。
需要的文字:
WBGene00000001 aap-1 Y110A7A.10 phosphoinositide kinase AdAPter subunit aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions. Enables protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and neurons. Human ortholog(s) of this gene implicated in several diseases, including Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency 36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit 3).
WBGene00000002 aat-1 F27C8.1 Amino Acid Transporter aat-1 encodes an amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1 is able to facilitate amino acid uptake and exchange, showing a relatively high affinity for small and some large neutral amino acids; in addition, AAT-1 is able to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly. Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Located in plasma membrane. Part of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons; and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
WBGene00000003 aat-2 F07C3.7 Amino Acid Transporter aat-2 encodes a predicted amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however, AAT-2 is not able to induce amino acid uptake. Predicted to enable L-amino acid transmembrane transporter activity. Predicted to be involved in L-alpha-amino acid transmembrane transport and L-amino acid transport. Predicted to be located in membrane. Predicted to be integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria. Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
我不确定我是否理解你的问题。但是为了在您的数据框图片中获得结果,我建议像
awk '
BEGIN { COLSEP = "\t"; gcd = ""; ad = ""; cd = ""; flag = 0 }
/^WBGene/ { printf "\n%s%s%s%s%s", , COLSEP, , COLSEP, }
/^Gene class description:/ { flag = 1; ===""; }
/^Automated description:/ { flag = 2; ==""; }
/^Concise description:/ { flag = 3; ==""; }
/=/ { flag = 0; printf "%s%s%s%s%s", gcd, COLSEP, cd, COLSEP, ad; gcd = ""; ad = ""; cd = ""}
flag==1 { gcd = gcd [=10=] }
flag==2 { ad = ad [=10=] }
flag==3 { cd = cd [=10=] }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt
假设输出是制表符分隔的,一个 awk
想法:
awk '
BEGIN { OFS="\t" }
function print_output() { if (baseID) print baseID,gene_name,trans_name,gene_desc,concise_desc,auto_desc; baseID="" }
~ /WBGene/ { baseID=; gene_name=; trans_name= }
/^Gene class description:/ { gene_desc =substr([=10=], index([=10=],": ")+2) ; in_block="" }
/^Concise description:/ { concise_desc =substr([=10=], index([=10=],": ")+2) ; in_block="concise"; pfx=""; next }
/^Automated description:/ { auto_desc =substr([=10=], index([=10=],": ")+2) ; in_block="auto" ; pfx=""; next }
in_block { if (in_block == "concise")
concise_desc = concise_desc pfx [=10=]
else
auto_desc = auto_desc pfx [=10=]
pfx=" "
}
== "=" { print_output() }
END { print_output() }
' input.file
对于提供的样本,这会生成:
WBGene00000001 aap-1 Y110A7A.10 phosphoinositide kinase AdAPter subunit aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions. Enables protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and neurons. Human ortholog(s) of this gene implicated in several diseases, including Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency 36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit 3).
WBGene00000002 aat-1 F27C8.1 Amino Acid Transporter aat-1 encodes an amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1 is able to facilitate amino acid uptake and exchange, showing a relatively high affinity for small and some large neutral amino acids; in addition, AAT-1 is able to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly. Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Located in plasma membrane. Part of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons; and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
WBGene00000003 aat-2 F07C3.7 Amino Acid Transporter aat-2 encodes a predicted amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however, AAT-2 is not able to induce amino acid uptake. Predicted to enable L-amino acid transmembrane transporter activity. Predicted to be involved in L-alpha-amino acid transmembrane transport and L-amino acid transport. Predicted to be located in membrane. Predicted to be integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria. Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
我想使用 grep
/awk
/sed
来解析包含各种基因描述的文本文件。
下载文件
wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz
下面的示例文本:
WBGene00000001 aap-1 Y110A7A.10
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan
and dauer development, and likely functions as the sole adaptor subunit for the
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates
insulin-like signaling, it is not absolutely required for insulin-like signaling
under most conditions.
Automated description: Enables protein kinase binding activity. Involved in dauer
larval development; determination of adult lifespan; and insulin receptor signaling
pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and
neurons. Human ortholog(s) of this gene implicated in several diseases, including
Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency
36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit
3).
Gene class description: phosphoinositide kinase AdAPter subunit
=
WBGene00000002 aat-1 F27C8.1
Concise description: aat-1 encodes an amino acid transporter catalytic subunit;
when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1
is able to facilitate amino acid uptake and exchange, showing a relatively high
affinity for small and some large neutral amino acids; in addition, AAT-1 is able
to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus
expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface
of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly.
Automated description: Contributes to L-amino acid transmembrane transporter activity.
Involved in amino acid transmembrane transport. Located in plasma membrane. Part
of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons;
and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric
protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member
8).
Gene class description: Amino Acid Transporter
=
WBGene00000003 aat-2 F07C3.7
Concise description: aat-2 encodes a predicted amino acid transporter catalytic
subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however,
AAT-2 is not able to induce amino acid uptake.
Automated description: Predicted to enable L-amino acid transmembrane transporter
activity. Predicted to be involved in L-alpha-amino acid transmembrane transport
and L-amino acid transport. Predicted to be located in membrane. Predicted to be
integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria.
Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
Gene class description: Amino Acid Transporter
此文本文件包含每个基因名称(例如 WBGene00000004 aat-3 F52H2.2a),简明描述:,自动描述:,基因 class 描述:以等号“=”分隔。
我一直在尝试解析这个 txt 文件,所以我想我首先要分别提取每一列和每一行(基因)。 下面是我的代码
#genes
grep "WBGene" c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_WBgenes.txt
#gene class description:
awk '/Gene class description:/' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_geneclass.txt
#concise description
awk '
/Concise description:/ { flag=1; pfx="" }
/Automated description/ { flag=0; print "" }
flag { printf "%s%s",pfx,[=13=]; pfx=" " } # assuming appended lines are separated by a single space
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_concise.txt
#automated description
awk '
/Automated description:/ { flag=1; pfx="" }
/Gene class description:/ { flag=0; print "" }
flag { printf "%s%s",pfx,[=13=]; pfx=" " } # assuming appended lines are separated by a single space
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_automated.txt
我的问题:有没有办法结合我的 code/or 新代码来更好地解决我的问题?
我想提取每个基因名称、简明描述:、自动描述:和基因 class 描述:在单独的列中,每行代表一个基因。
我想创建一个 txt 文件,其中包含每一行作为一个基因,每一列包含描述选项。
需要的文字:
WBGene00000001 aap-1 Y110A7A.10 phosphoinositide kinase AdAPter subunit aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions. Enables protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and neurons. Human ortholog(s) of this gene implicated in several diseases, including Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency 36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit 3).
WBGene00000002 aat-1 F27C8.1 Amino Acid Transporter aat-1 encodes an amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1 is able to facilitate amino acid uptake and exchange, showing a relatively high affinity for small and some large neutral amino acids; in addition, AAT-1 is able to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly. Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Located in plasma membrane. Part of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons; and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
WBGene00000003 aat-2 F07C3.7 Amino Acid Transporter aat-2 encodes a predicted amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however, AAT-2 is not able to induce amino acid uptake. Predicted to enable L-amino acid transmembrane transporter activity. Predicted to be involved in L-alpha-amino acid transmembrane transport and L-amino acid transport. Predicted to be located in membrane. Predicted to be integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria. Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
我不确定我是否理解你的问题。但是为了在您的数据框图片中获得结果,我建议像
awk '
BEGIN { COLSEP = "\t"; gcd = ""; ad = ""; cd = ""; flag = 0 }
/^WBGene/ { printf "\n%s%s%s%s%s", , COLSEP, , COLSEP, }
/^Gene class description:/ { flag = 1; ===""; }
/^Automated description:/ { flag = 2; ==""; }
/^Concise description:/ { flag = 3; ==""; }
/=/ { flag = 0; printf "%s%s%s%s%s", gcd, COLSEP, cd, COLSEP, ad; gcd = ""; ad = ""; cd = ""}
flag==1 { gcd = gcd [=10=] }
flag==2 { ad = ad [=10=] }
flag==3 { cd = cd [=10=] }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt
假设输出是制表符分隔的,一个 awk
想法:
awk '
BEGIN { OFS="\t" }
function print_output() { if (baseID) print baseID,gene_name,trans_name,gene_desc,concise_desc,auto_desc; baseID="" }
~ /WBGene/ { baseID=; gene_name=; trans_name= }
/^Gene class description:/ { gene_desc =substr([=10=], index([=10=],": ")+2) ; in_block="" }
/^Concise description:/ { concise_desc =substr([=10=], index([=10=],": ")+2) ; in_block="concise"; pfx=""; next }
/^Automated description:/ { auto_desc =substr([=10=], index([=10=],": ")+2) ; in_block="auto" ; pfx=""; next }
in_block { if (in_block == "concise")
concise_desc = concise_desc pfx [=10=]
else
auto_desc = auto_desc pfx [=10=]
pfx=" "
}
== "=" { print_output() }
END { print_output() }
' input.file
对于提供的样本,这会生成:
WBGene00000001 aap-1 Y110A7A.10 phosphoinositide kinase AdAPter subunit aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions. Enables protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and neurons. Human ortholog(s) of this gene implicated in several diseases, including Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency 36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit 3).
WBGene00000002 aat-1 F27C8.1 Amino Acid Transporter aat-1 encodes an amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1 is able to facilitate amino acid uptake and exchange, showing a relatively high affinity for small and some large neutral amino acids; in addition, AAT-1 is able to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly. Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Located in plasma membrane. Part of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons; and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
WBGene00000003 aat-2 F07C3.7 Amino Acid Transporter aat-2 encodes a predicted amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however, AAT-2 is not able to induce amino acid uptake. Predicted to enable L-amino acid transmembrane transporter activity. Predicted to be involved in L-alpha-amino acid transmembrane transport and L-amino acid transport. Predicted to be located in membrane. Predicted to be integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria. Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).