awk 从基于行的文件中匹配模式并输出为 CSV
awk matching patterns from row based file and output as CSV
我有一个包含这种格式的记录的文件:
LOCUS NG_029783 19834 bp DNA linear PRI 03-OCT-2014 DEFINITION Homo sapiens long intergenic non-protein coding RNA 1546
(LINC01546), RefSeqGene on chromosome X. ACCESSION NG_029783 VERSION NG_029783.1 KEYWORDS RefSeq; RefSeqGene. SOURCE Homo sapiens (human) ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo. COMMENT VALIDATED REFSEQ: This record has undergone validation or
preliminary review. The reference sequence was derived from
AC004616.1.
This sequence is a reference standard in the RefSeqGene project. PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP
1-19834 AC004616.1 8636-28469 FEATURES Location/Qualifiers
source 1..19834
/organism="Homo sapiens"
/mol_type="genomic DNA"
/db_xref="taxon:9606"
/chromosome="X"
/map="Xp22.33"
variation 4
/replace="c"
/replace="t"
/db_xref="dbSNP:1205550"
variation 17
/replace="c"
/replace="t"
/db_xref="dbSNP:1205551"
gene 5001..5948
/gene="OR6K3"
/gene_synonym="OR1-18"
/note="olfactory receptor family 6 subfamily K member 3"
/db_xref="GeneID:391114"
/db_xref="HGNC:HGNC:15030"
mRNA 5001..5948
/gene="OR6K3"
/gene_synonym="OR1-18"
/product="olfactory receptor family 6 subfamily K member
//
LOCUS NG_032962 70171 bp DNA linear PRI 17-JUN-2016 DEFINITION Homo sapiens death domain containing 1 (DTHD1), RefSeqGene on
chromosome 4. ACCESSION NG_032962 VERSION NG_032962.1 KEYWORDS RefSeq; RefSeqGene. SOURCE Homo sapiens (human) ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo. COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AC104078.3.
This sequence is a reference standard in the RefSeqGene project.
Summary: This gene encodes a protein which contains a death domain.
Death domain-containing proteins function in signaling pathways and
formation of signaling complexes, as well as the apoptosis pathway.
Alternative splicing results in multiple transcript variants.
[provided by RefSeq, Oct 2012]. PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP
1-70171 AC104078.3 59395-129565 FEATURES Location/Qualifiers
source 1..70171
/organism="Homo sapiens"
/mol_type="genomic DNA"
/db_xref="taxon:9606"
/chromosome="4"
/map="4p14"
gene 5001..129091
/gene="REEP1"
/gene_synonym="C2orf23; HMN5B; SPG31; Yip2a"
/note="receptor accessory protein 1"
/db_xref="GeneID:65055"
/db_xref="HGNC:HGNC:25786"
/db_xref="MIM:609139"
mRNA join(5001..5060,60842..60914,79043..79119,88270..88390,
91014..91127,110282..110459,125974..129091)
/gene="REEP1"
/gene_synonym="C2orf23; HMN5B; SPG31; Yip2a"
/product="receptor accessory protein 1, transcript variant
1"
/transcript_id="NM_001164730.1"
我一直在使用这个工作流程:
删除空格
gawk '{=}1' raw_file.text > temp_file.txt
匹配"Summary"内容gawk /Summary/,/\]/{print} temp_file.text > summary_temp.txt
删除新行gawk 'BEGIN {RS=""}{gsub(/\n/,"",[=13=]); print [=13=]}' summary_temp.text > summary.txt
我有几个问题。
首先,我如何结合这 3 个步骤。
其次,我怎样才能 select 一个或多个额外的匹配项,例如匹配 '/gene="AP3B2"' (这需要匹配 "gene" 之后的第一个“/gene”实例)所以我可以以这种形式输出内容:
基因,总结
$ cat tst.awk
BEGIN{RS="//"}
{
match([=10=], /\/gene="([^"]+)"/, a)
print a[1] ", ",
gensub(/\s\s+/, "", "g", gensub(/.*Summary:\s([^\[]+).*/, "\1", "g"))
}
解释:
match([=15=], /\/gene="([^"]+)"/, a)
捕获数组 a 中的所有“\gene”部分。根据您的问题,只需要第一次出现,即 a[1](顺便说一句,不是 AP3B2)。
gensub(/.*Summary:\s([^\[]+).*/, "\1", "g")
捕获 "Summary: " 之后的所有内容,直到找到“[”。
最后一个结果有空格和换行符。让我们摆脱它们:
gensub(/\s\s+/, "", "g", <<result of 1st gensub>>)
EDIT: Not every record contains a "Summary"
将脚本更改为:
$ cat tst.awk
BEGIN{RS="//"}
{
match([=13=], /\/gene="([^"]+)"/, a)
match([=13=], /Summary:\s([^\[]+)/, b)
print a[1] ",",
gensub(/\s\s+/, " ", "g", b[1])
}
运行 OP 提供输入的脚本:
awk -f tst.awk tst.txt
OR6K3,
REEP1, This gene encodes a protein which contains a death domain.Death
domain-containing proteins function in signaling pathways andformation
of signaling complexes, as well as the apoptosis pathway.Alternative
splicing results in multiple transcript variants.
我有一个包含这种格式的记录的文件:
LOCUS NG_029783 19834 bp DNA linear PRI 03-OCT-2014 DEFINITION Homo sapiens long intergenic non-protein coding RNA 1546
(LINC01546), RefSeqGene on chromosome X. ACCESSION NG_029783 VERSION NG_029783.1 KEYWORDS RefSeq; RefSeqGene. SOURCE Homo sapiens (human) ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo. COMMENT VALIDATED REFSEQ: This record has undergone validation or
preliminary review. The reference sequence was derived from
AC004616.1.
This sequence is a reference standard in the RefSeqGene project. PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP
1-19834 AC004616.1 8636-28469 FEATURES Location/Qualifiers
source 1..19834
/organism="Homo sapiens"
/mol_type="genomic DNA"
/db_xref="taxon:9606"
/chromosome="X"
/map="Xp22.33"
variation 4
/replace="c"
/replace="t"
/db_xref="dbSNP:1205550"
variation 17
/replace="c"
/replace="t"
/db_xref="dbSNP:1205551"
gene 5001..5948
/gene="OR6K3"
/gene_synonym="OR1-18"
/note="olfactory receptor family 6 subfamily K member 3"
/db_xref="GeneID:391114"
/db_xref="HGNC:HGNC:15030"
mRNA 5001..5948
/gene="OR6K3"
/gene_synonym="OR1-18"
/product="olfactory receptor family 6 subfamily K member
//
LOCUS NG_032962 70171 bp DNA linear PRI 17-JUN-2016 DEFINITION Homo sapiens death domain containing 1 (DTHD1), RefSeqGene on
chromosome 4. ACCESSION NG_032962 VERSION NG_032962.1 KEYWORDS RefSeq; RefSeqGene. SOURCE Homo sapiens (human) ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo. COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AC104078.3.
This sequence is a reference standard in the RefSeqGene project.
Summary: This gene encodes a protein which contains a death domain.
Death domain-containing proteins function in signaling pathways and
formation of signaling complexes, as well as the apoptosis pathway.
Alternative splicing results in multiple transcript variants.
[provided by RefSeq, Oct 2012]. PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP
1-70171 AC104078.3 59395-129565 FEATURES Location/Qualifiers
source 1..70171
/organism="Homo sapiens"
/mol_type="genomic DNA"
/db_xref="taxon:9606"
/chromosome="4"
/map="4p14"
gene 5001..129091
/gene="REEP1"
/gene_synonym="C2orf23; HMN5B; SPG31; Yip2a"
/note="receptor accessory protein 1"
/db_xref="GeneID:65055"
/db_xref="HGNC:HGNC:25786"
/db_xref="MIM:609139"
mRNA join(5001..5060,60842..60914,79043..79119,88270..88390,
91014..91127,110282..110459,125974..129091)
/gene="REEP1"
/gene_synonym="C2orf23; HMN5B; SPG31; Yip2a"
/product="receptor accessory protein 1, transcript variant
1"
/transcript_id="NM_001164730.1"
我一直在使用这个工作流程:
删除空格
gawk '{=}1' raw_file.text > temp_file.txt
匹配"Summary"内容
gawk /Summary/,/\]/{print} temp_file.text > summary_temp.txt
删除新行
gawk 'BEGIN {RS=""}{gsub(/\n/,"",[=13=]); print [=13=]}' summary_temp.text > summary.txt
我有几个问题。 首先,我如何结合这 3 个步骤。 其次,我怎样才能 select 一个或多个额外的匹配项,例如匹配 '/gene="AP3B2"' (这需要匹配 "gene" 之后的第一个“/gene”实例)所以我可以以这种形式输出内容:
基因,总结
$ cat tst.awk
BEGIN{RS="//"}
{
match([=10=], /\/gene="([^"]+)"/, a)
print a[1] ", ",
gensub(/\s\s+/, "", "g", gensub(/.*Summary:\s([^\[]+).*/, "\1", "g"))
}
解释:
match([=15=], /\/gene="([^"]+)"/, a)
捕获数组 a 中的所有“\gene”部分。根据您的问题,只需要第一次出现,即 a[1](顺便说一句,不是 AP3B2)。
gensub(/.*Summary:\s([^\[]+).*/, "\1", "g")
捕获 "Summary: " 之后的所有内容,直到找到“[”。 最后一个结果有空格和换行符。让我们摆脱它们:
gensub(/\s\s+/, "", "g", <<result of 1st gensub>>)
EDIT: Not every record contains a "Summary"
将脚本更改为:
$ cat tst.awk
BEGIN{RS="//"}
{
match([=13=], /\/gene="([^"]+)"/, a)
match([=13=], /Summary:\s([^\[]+)/, b)
print a[1] ",",
gensub(/\s\s+/, " ", "g", b[1])
}
运行 OP 提供输入的脚本:
awk -f tst.awk tst.txt
OR6K3,
REEP1, This gene encodes a protein which contains a death domain.Death
domain-containing proteins function in signaling pathways andformation
of signaling complexes, as well as the apoptosis pathway.Alternative
splicing results in multiple transcript variants.