从输入文件中提取匹配模式并在 Perl 中打印到输出文件
Extract matching pattern from input file and print to output file in Perl
我有来自 ncbi blastn 的大量输入文件,格式如下:
<ul id="msgFrm" class=" msg">
Job Title: otu0
Database: rRNA_typestrains/prokaryotic_16S_ribosomal_RNA 16S ribosomal RNA (Bacteria and Archaea)
Query #1: otu0 Query ID: lcl|Query_16950 Length: 460
Sequences producing significant alignments:
Max Total Query E Per.
Description Score Score cover Value Ident Accession
Bacteroides dorei strain 175 16S ribosomal RNA, partial sequence 839 839 100% 0.0 99.57 NR_041351.1
Alignments:
>Bacteroides dorei strain 175 16S ribosomal RNA, partial sequence
Sequence ID: NR_041351.1 Length: 1493
Range 1: 341 to 800
Score:839 bits(454), Expect:0.0,
Identities:458/460(99%), Gaps:0/460(0%), Strand: Plus/Plus
Query 1 CCTACGGGGGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 60
|||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 341 CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 400
Query #2: otu1 Query ID: lcl|Query_16951 Length: 460
Sequences producing significant alignments:
Max Total Query E Per.
Description Score Score cover Value Ident Accession
Bacteroides vulgatus ATCC 8482 16S ribosomal RNA, partial... 811 811 100% 0.0 98.48 NR_074515.1
Bacteroides paurosaccharolyticus strain WK042 16S ribosomal RN... 673 673 100% 0.0 93.06 NR_112668.1
Alignments:
>Bacteroides dorei strain 175 16S ribosomal RNA, partial sequence
Sequence ID: NR_041351.1 Length: 1493
Range 1: 341 to 800
Score:833 bits(451), Expect:0.0,
Identities:457/460(99%), Gaps:0/460(0%), Strand: Plus/Plus
Query 1 CCTACGGGGGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 60
|||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 341 CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 400
Query #3: otu2 Query ID: lcl|Query_16952 Length: 460
Sequences producing significant alignments:
Max Total Query E Per.
Description Score Score cover Value Ident Accession
Bacteroides sartorii JCM 16497 16S ribosomal RNA, partial... 684 684 100% 0.0 93.48 NR_113064.1
Bacteroides paurosaccharolyticus strain WK042 16S ribosomal RN... 678 678 100% 0.0 93.28 NR_112668.1
Alignments:
>Bacteroides dorei strain 175 16S ribosomal RNA, partial sequence
Sequence ID: NR_041351.1 Length: 1493
Range 1: 341 to 800
Score:839 bits(454), Expect:0.0,
Identities:458/460(99%), Gaps:0/460(0%), Strand: Plus/Plus
Query 1 CCTACGGGTGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 60
|||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 341 CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 400
...等等。
现在,对于文件中的每个 "Query #",我需要提取包含在 "Query #" 和 "Alignments" 之间的段落信息,并在另一个输出文件中打印。
我尝试使用以下基本代码,但在 Perl 编程中遇到了一些困难。有人可以帮我解决这个问题吗?非常感谢
#!/usr/bin/perl -w
use warnings;
open(GENBANK, "Bact_ncbi_output.txt") or die;
my $outfile = 'output.txt';
open OUT,'>',$outfile
or die "Could not open $outfile : $!";
$content = join("", <GENBANK>);
close GENBANK;
$content =~ /Query\s+([A-Z0-9_.]+)Alignments$/;
print OUT "$content\n";
请试试这个:
#!/usr/bin/perl -w
use warnings;
open(GENBANK, "txt.txt") or die "";
local $/; $_= <GENBANK>; my $content = $_;
my $outfile = 'output.txt';
open(OUT,">",$outfile) or die "Could not open $outfile : $!";
my (@fetch) = $content=~m{(Query\s*\#((?:(?!Alignments\:\n).)*)\nAlignments\:\n)}gs;
#print OUT @fetch;
print OUT join "\n", @fetch;
请根据需要修改。
谢谢。
我有来自 ncbi blastn 的大量输入文件,格式如下:
<ul id="msgFrm" class=" msg">
Job Title: otu0
Database: rRNA_typestrains/prokaryotic_16S_ribosomal_RNA 16S ribosomal RNA (Bacteria and Archaea)
Query #1: otu0 Query ID: lcl|Query_16950 Length: 460
Sequences producing significant alignments:
Max Total Query E Per.
Description Score Score cover Value Ident Accession
Bacteroides dorei strain 175 16S ribosomal RNA, partial sequence 839 839 100% 0.0 99.57 NR_041351.1
Alignments:
>Bacteroides dorei strain 175 16S ribosomal RNA, partial sequence
Sequence ID: NR_041351.1 Length: 1493
Range 1: 341 to 800
Score:839 bits(454), Expect:0.0,
Identities:458/460(99%), Gaps:0/460(0%), Strand: Plus/Plus
Query 1 CCTACGGGGGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 60
|||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 341 CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 400
Query #2: otu1 Query ID: lcl|Query_16951 Length: 460
Sequences producing significant alignments:
Max Total Query E Per.
Description Score Score cover Value Ident Accession
Bacteroides vulgatus ATCC 8482 16S ribosomal RNA, partial... 811 811 100% 0.0 98.48 NR_074515.1
Bacteroides paurosaccharolyticus strain WK042 16S ribosomal RN... 673 673 100% 0.0 93.06 NR_112668.1
Alignments:
>Bacteroides dorei strain 175 16S ribosomal RNA, partial sequence
Sequence ID: NR_041351.1 Length: 1493
Range 1: 341 to 800
Score:833 bits(451), Expect:0.0,
Identities:457/460(99%), Gaps:0/460(0%), Strand: Plus/Plus
Query 1 CCTACGGGGGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 60
|||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 341 CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 400
Query #3: otu2 Query ID: lcl|Query_16952 Length: 460
Sequences producing significant alignments:
Max Total Query E Per.
Description Score Score cover Value Ident Accession
Bacteroides sartorii JCM 16497 16S ribosomal RNA, partial... 684 684 100% 0.0 93.48 NR_113064.1
Bacteroides paurosaccharolyticus strain WK042 16S ribosomal RN... 678 678 100% 0.0 93.28 NR_112668.1
Alignments:
>Bacteroides dorei strain 175 16S ribosomal RNA, partial sequence
Sequence ID: NR_041351.1 Length: 1493
Range 1: 341 to 800
Score:839 bits(454), Expect:0.0,
Identities:458/460(99%), Gaps:0/460(0%), Strand: Plus/Plus
Query 1 CCTACGGGTGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 60
|||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 341 CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGT 400
...等等。
现在,对于文件中的每个 "Query #",我需要提取包含在 "Query #" 和 "Alignments" 之间的段落信息,并在另一个输出文件中打印。
我尝试使用以下基本代码,但在 Perl 编程中遇到了一些困难。有人可以帮我解决这个问题吗?非常感谢
#!/usr/bin/perl -w
use warnings;
open(GENBANK, "Bact_ncbi_output.txt") or die;
my $outfile = 'output.txt';
open OUT,'>',$outfile
or die "Could not open $outfile : $!";
$content = join("", <GENBANK>);
close GENBANK;
$content =~ /Query\s+([A-Z0-9_.]+)Alignments$/;
print OUT "$content\n";
请试试这个:
#!/usr/bin/perl -w
use warnings;
open(GENBANK, "txt.txt") or die "";
local $/; $_= <GENBANK>; my $content = $_;
my $outfile = 'output.txt';
open(OUT,">",$outfile) or die "Could not open $outfile : $!";
my (@fetch) = $content=~m{(Query\s*\#((?:(?!Alignments\:\n).)*)\nAlignments\:\n)}gs;
#print OUT @fetch;
print OUT join "\n", @fetch;
请根据需要修改。
谢谢。