匹配'>'的Perl正则表达式
Perl regular expression to match '>'
这是我文件中数据的排列方式。
>Contig1
TGGCACCTTCGACAGTTGCTCCCTCCTGGGTGGGGGCCGTCTGACCTCGCTGTACTCCT
>Contig2
GGGCCTTGGGAAGCGCAGGTGCCGAGAACTTGGCTAGAGCGGTAGACAATGCGGTTCGTG
AAAAGAGCAACTTTAAATACTTGTACGACCTCAACCAGCCAGTCAAAGAGAAAATCGAG
>NODE_105957_length_443_cov_1.000000
TCAGAAGTTAATGCAATCTGGTCCATTAAGTAAATGGGTATCATGGTACATAAACTAAAA
GCACAGAACATGGATTATTTTCCCAATTTTAACTTTCCTAACCATTTTTATCTCTCTCAA
TAACTTCCACAGTAGTTTTTATTCGTCTCAATAACTTTATTAAAAGGGATCCCTCTATCC
CCAGAATTCAGTAGCTGCATACGACTTTCCTGTCACTAGAGATCCCTCAGATGTCGGTAG
TGCATTCATCTTAAGTGATAAATCAAATGTTAGTCAAGTTAGGAAGTGAGAATTGATACA
GAATTTCTACTTCAATACTAGCTATCCCAAAATGGTCATTGACGATTTATTTTTTTCCTA
CCAGCATATTCTTTTCTAGTATTTCAGATCTAGTGACTCAGAACTAGGACAATCATAAAT
TTGAAGGGAACCTTAAGTCTTTTTTCATGCTGAGACTGCCAAG
>NODE_105950_length_95_cov_1.000000
TCAGGTCCTACTTCATTTGTAAGGAAAACTGACAGGTAATTCAGTGGGACAGAATACCAT
GTGAAGAGTTTCCTCTCACCTGAGAGGAGACTTTTTGATGATGATGATGATCAAT
能否就如何提取序列提供建议,即仅包含 A、T、G、C 的行,并在每组连续的序列之间换行。这是我到目前为止的代码
#!/usr/bin/perl
print "Enter the first filename\n";
$filename = <>;
print "Enter the output file for ids\n";
$filename1 = <>;
print "Enter the output file for sequences\n";
$filename2 = <>;
my $first = ">";
open(FILE, $filename) or die "Could not read from $filename, program halting.";
open(FIL, '>', $filename1) or die "Could not read from $filename1, program halting.";
open(FILES, '>', $filename2) or die "Could not read from $filename2, program halting.";
while(my $line = <FILE>)
{
if ($line =~ m//s)
{
print FILES $line, "\n";
}
if ($line =~ m/^>/)
{
print FIL $line;
}
}
close FILE;
close FIL;
close FILES;
这只是一个基本的常规、简单的 perl 程序来匹配模式。感谢任何帮助。
你可以使用这个正则表达式
/^[ATGC]+$/gm
此处演示 https://regex101.com/r/rQ9gN4/2
如果你想提取
NODE_105957_length_443_cov_1.000000
NODE_105950_length_95_cov_1.000000
否定上面的正则表达式
/^([^ATGC]+)$/gm
尝试一下:
#!/usr/bin/perl
# ALLWAYS
use strict;
use warnings;
print "Enter the first filename\n";
chomp (my $filename = <>); # remove the line break
print "Enter the output file for ids\n";
chomp (my $filename1 = <>); # remove the line break
print "Enter the output file for sequences\n";
chomp (my $filename1 = <>); # remove the line break
# use three args open and show the reason when it fails
open(my $FILE, '<', $filename) or die "Unable to open '$filename', $!";
open(my $FILE1, '>', $filename1) or die "Unable to open '$filename1', $!";
open(my $FILE2, '>', $filename2) or die "Unable to open '$filename2', $!";
while(my $line = <$FILE>) {
chomp($line); # remove line break
if ($line =~ /^>/) {
print $FILE1 $line,"\n";
# add a line break to filename2 unless we are at first line.
print $FILE2 "\n" unless $. < 2;
}
else {
print $FILE2 $line;
}
}
这是我文件中数据的排列方式。
>Contig1
TGGCACCTTCGACAGTTGCTCCCTCCTGGGTGGGGGCCGTCTGACCTCGCTGTACTCCT
>Contig2
GGGCCTTGGGAAGCGCAGGTGCCGAGAACTTGGCTAGAGCGGTAGACAATGCGGTTCGTG
AAAAGAGCAACTTTAAATACTTGTACGACCTCAACCAGCCAGTCAAAGAGAAAATCGAG
>NODE_105957_length_443_cov_1.000000
TCAGAAGTTAATGCAATCTGGTCCATTAAGTAAATGGGTATCATGGTACATAAACTAAAA
GCACAGAACATGGATTATTTTCCCAATTTTAACTTTCCTAACCATTTTTATCTCTCTCAA
TAACTTCCACAGTAGTTTTTATTCGTCTCAATAACTTTATTAAAAGGGATCCCTCTATCC
CCAGAATTCAGTAGCTGCATACGACTTTCCTGTCACTAGAGATCCCTCAGATGTCGGTAG
TGCATTCATCTTAAGTGATAAATCAAATGTTAGTCAAGTTAGGAAGTGAGAATTGATACA
GAATTTCTACTTCAATACTAGCTATCCCAAAATGGTCATTGACGATTTATTTTTTTCCTA
CCAGCATATTCTTTTCTAGTATTTCAGATCTAGTGACTCAGAACTAGGACAATCATAAAT
TTGAAGGGAACCTTAAGTCTTTTTTCATGCTGAGACTGCCAAG
>NODE_105950_length_95_cov_1.000000
TCAGGTCCTACTTCATTTGTAAGGAAAACTGACAGGTAATTCAGTGGGACAGAATACCAT
GTGAAGAGTTTCCTCTCACCTGAGAGGAGACTTTTTGATGATGATGATGATCAAT
能否就如何提取序列提供建议,即仅包含 A、T、G、C 的行,并在每组连续的序列之间换行。这是我到目前为止的代码
#!/usr/bin/perl
print "Enter the first filename\n";
$filename = <>;
print "Enter the output file for ids\n";
$filename1 = <>;
print "Enter the output file for sequences\n";
$filename2 = <>;
my $first = ">";
open(FILE, $filename) or die "Could not read from $filename, program halting.";
open(FIL, '>', $filename1) or die "Could not read from $filename1, program halting.";
open(FILES, '>', $filename2) or die "Could not read from $filename2, program halting.";
while(my $line = <FILE>)
{
if ($line =~ m//s)
{
print FILES $line, "\n";
}
if ($line =~ m/^>/)
{
print FIL $line;
}
}
close FILE;
close FIL;
close FILES;
这只是一个基本的常规、简单的 perl 程序来匹配模式。感谢任何帮助。
你可以使用这个正则表达式
/^[ATGC]+$/gm
此处演示 https://regex101.com/r/rQ9gN4/2
如果你想提取
NODE_105957_length_443_cov_1.000000 NODE_105950_length_95_cov_1.000000
否定上面的正则表达式
/^([^ATGC]+)$/gm
尝试一下:
#!/usr/bin/perl
# ALLWAYS
use strict;
use warnings;
print "Enter the first filename\n";
chomp (my $filename = <>); # remove the line break
print "Enter the output file for ids\n";
chomp (my $filename1 = <>); # remove the line break
print "Enter the output file for sequences\n";
chomp (my $filename1 = <>); # remove the line break
# use three args open and show the reason when it fails
open(my $FILE, '<', $filename) or die "Unable to open '$filename', $!";
open(my $FILE1, '>', $filename1) or die "Unable to open '$filename1', $!";
open(my $FILE2, '>', $filename2) or die "Unable to open '$filename2', $!";
while(my $line = <$FILE>) {
chomp($line); # remove line break
if ($line =~ /^>/) {
print $FILE1 $line,"\n";
# add a line break to filename2 unless we are at first line.
print $FILE2 "\n" unless $. < 2;
}
else {
print $FILE2 $line;
}
}