在 Perl 中将数字添加到相似的名称
Adding numbers to similar names in Perl
我正在尝试使用 Perl 将 blast 文件转换为 gff3,由于我从事科学工作,所以我对编程还很陌生。
我当前的代码如下:
use strict;
use warnings;
use diagnostics;
my $db = "BLAST";
my $prog = "blastn";
my $subid = "";
open(my $inFile, $ARGV[0]) || die "Could not open file '$ARGV[0]' $!";
open(my $outFile, ">$ARGV[1]") || die "Could not find file '>$ARGV[1]' $!";
print $outFile "##gff-version 3\n#\n#\n";
while(<$inFile>){
my ($qseqid, $sseqid, $pident, $length, $mismatch, $gaps, $qstart, $qend, $sstart, $send, $evalue, $bitscore) = split(/\t/);
if($qstart < $qend){
$sign = "+";
} elsif($qstart > $qend){
$sign = "-";
} else {
die "Unexpected qstart and end";
}
$bitscore =~ s/^\s*(.*?)\s*$//;
print $outFile "$sseqid\t$db\t$prog\t$qstart\t$qend\t$bitscore\t$sign\t.\t$subid\n";
}
我的输出是
scf_62525_290.contig_1 BLAST blastn 1 3954 7302 + . scf_62525_290.contig_1
scf_62525_290.contig_1 BLAST blastn 4178 6577 4433 + . scf_62525_290.contig_1
scf_62525_290.contig_1 BLAST blastn 3953 4114 300 + . scf_62525_290.contig_1
scf_62525_290.contig_1 BLAST blastn 4115 4178 119 + . scf_62525_290.contig_1
scf_62525_1067.contig_1 BLAST blastn 1 1665 3075 + . scf_62525_1067.contig_1
scf_62525_163.contig_1 BLAST blastn 7 357 612 + . scf_62525_163.contig_1
scf_62525_4028.contig_1 BLAST blastn 1 1321 2436 + . scf_62525_4028.contig_1
scf_62525_4028.contig_1 BLAST blastn 1319 2231 1687 + . scf_62525_4028.contig_1
scf_62525_4028.contig_1 BLAST blastn 1275 1321 87.9 + . scf_62525_4028.contig_1
我想把它改成这个输出
scf_62525_290.contig_1 BLAST blastn 1 3954 7302 + . scf_62525_290.contig_1.t1.d1
scf_62525_290.contig_1 BLAST blastn 4178 6577 4433 + . scf_62525_290.contig_1.t1.d2
scf_62525_290.contig_1 BLAST blastn 3953 4114 300 + . scf_62525_290.contig_1.t1.d3
scf_62525_290.contig_1 BLAST blastn 4115 4178 119 + . scf_62525_290.contig_1.t1.d4
scf_62525_1067.contig_1 BLAST blastn 1 1665 3075 + . scf_62525_1067.contig_1.t1.d1
scf_62525_163.contig_1 BLAST blastn 7 357 612 + . scf_62525_163.contig_1.t1.d1
scf_62525_4028.contig_1 BLAST blastn 1 1321 2436 + . scf_62525_4028.contig_1.t1.d1
scf_62525_4028.contig_1 BLAST blastn 1319 2231 1687 + . scf_62525_4028.contig_1.t1.d2
scf_62525_4028.contig_1 BLAST blastn 1275 1321 87.9 + . scf_62525_4028.contig_1.t1.d3
有没有简单的方法?
谢谢
这是一些示例输入:
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 3954 0 0 1 3954 23690 27643 0.0 7302
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 2400 0 0 4178 6577 28076 30475 0.0 4433
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 162 0 0 3953 4114 27722 27883 1e-79 300
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 64 0 0 4115 4178 27957 28020 4e-25 119
Ppluv_s011067g00001.1 scf_62525_1067.contig_1 100.00 1665 0 0 1 1665 4944 6608 0.0 3075
Ppluv_s010163g00001.1 scf_62525_163.contig_1 97.77 359 0 8 7 357 797 439 8e-175 612
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 1321 0 0 1 1321 2322 1002 0.0 2436
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 913 0 0 1319 2231 924 12 0.0 1687
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 47 0 0 1275 1321 992 946 4e-16 87.9
Ppluv_s014028g00001.1 scf_62525_3545.contig_1 79.23 1343 241 38 1 1321 1 1327 0.0 902
Ppluv_s014028g00001.1 scf_62525_1712.contig_1 74.27 1951 403 99 340 2227 3076 4990 0.0 732
Ppluv_s014028g00001.1 scf_62525_817.contig_1 82.74 730 87 39 1378 2105 23175 22483 2e-174 614
Ppluv_s014028g00001.1 scf_62525_177.contig_1 76.37 804 178 12 1320 2117 29453 28656 1e-116 422
Ppluv_s014028g00001.1 scf_62525_177.contig_1 75.28 615 134 18 1326 1937 36037 35438 2e-73 278
我可能会使用计数器来完成。从我在这段代码中看到的情况来看,如果 $subid 相同,则它们会递增编号,扩展名似乎相同。也许你可以这样做
use strict;
use warnings;
use diagnostics;
my $db = "BLAST";
my $prog = "blastn";
my $subid = "";
open(my $inFile, $ARGV[0]) || die "Could not open file '$ARGV[0]' $!";
open(my $outFile, ">$ARGV[1]") || die "Could not find file '>$ARGV[1]' $!";
print $outFile "##gff-version 3\n#\n#\n";
my $cnt = 0;
my ($tmp,$fn);
while(<$inFile>){
my ($qseqid, $sseqid, $pident, $length, $mismatch, $gaps, $qstart, $qend, $sstart, $send, $evalue, $bitscore) = split(/\t/);
my $suffix = ".t1.d${cnt}";
if (!$tmp){
$tmp = $subid;
}
if ($tmp eq $subid){
$cnt++;
$fn = "${subid}${suffix}";
$tmp = $subid;
}
else {
$cnt = 1;
$fn = "${subid}${suffix}";
$tmp = $subid;
}
if($qstart < $qend){
$sign = "+";
} elsif($qstart > $qend){
$sign = "-";
} else {
die "Unexpected qstart and end";
}
$bitscore =~ s/^\s*(.*?)\s*$//;
print $outFile "$sseqid\t$db\t$prog\t$qstart\t$qend\t$bitscore\t$sign\t.\t$fn\n";
}
这是未经测试的,但你明白了
使用 hash of hashes 记录您看到每个 ID 对的次数:
use strict;
use warnings;
my $db = "BLAST";
my $prog = "blastn";
my %unique;
print "##gff-version 3\n#\n#\n";
while (<DATA>) {
my @fields = split;
my $sseqid = $fields[1];
my $qstart = $fields[6];
my $qend = $fields[7];
my $bitscore = $fields[11];
my $sign;
if ($qstart < $qend) {
$sign = "+";
} elsif ($qstart > $qend) {
$sign = "-";
} else {
die "Unexpected qstart and end";
}
my @id_parts = split(/[_.]/, $sseqid);
my $sub_id = ++$unique{$id_parts[1]}{$id_parts[2]};
print join("\t", $sseqid, $db, $prog, $qstart, $qend, $bitscore, $sign, '.', "$sseqid.$sub_id"), "\n";
}
__DATA__
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 3954 0 0 1 3954 23690 27643 0.0 7302
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 2400 0 0 4178 6577 28076 30475 0.0 4433
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 162 0 0 3953 4114 27722 27883 1e-79 300
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 64 0 0 4115 4178 27957 28020 4e-25 119
Ppluv_s011067g00001.1 scf_62525_1067.contig_1 100.00 1665 0 0 1 1665 4944 6608 0.0 3075
Ppluv_s010163g00001.1 scf_62525_163.contig_1 97.77 359 0 8 7 357 797 439 8e-175 612
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 1321 0 0 1 1321 2322 1002 0.0 2436
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 913 0 0 1319 2231 924 12 0.0 1687
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 47 0 0 1275 1321 992 946 4e-16 87.9
Ppluv_s014028g00001.1 scf_62525_3545.contig_1 79.23 1343 241 38 1 1321 1 1327 0.0 902
Ppluv_s014028g00001.1 scf_62525_1712.contig_1 74.27 1951 403 99 340 2227 3076 4990 0.0 732
Ppluv_s014028g00001.1 scf_62525_817.contig_1 82.74 730 87 39 1378 2105 23175 22483 2e-174 614
Ppluv_s014028g00001.1 scf_62525_177.contig_1 76.37 804 178 12 1320 2117 29453 28656 1e-116 422
Ppluv_s014028g00001.1 scf_62525_177.contig_1 75.28 615 134 18 1326 1937 36037 35438 2e-73 278
我正在尝试使用 Perl 将 blast 文件转换为 gff3,由于我从事科学工作,所以我对编程还很陌生。 我当前的代码如下:
use strict;
use warnings;
use diagnostics;
my $db = "BLAST";
my $prog = "blastn";
my $subid = "";
open(my $inFile, $ARGV[0]) || die "Could not open file '$ARGV[0]' $!";
open(my $outFile, ">$ARGV[1]") || die "Could not find file '>$ARGV[1]' $!";
print $outFile "##gff-version 3\n#\n#\n";
while(<$inFile>){
my ($qseqid, $sseqid, $pident, $length, $mismatch, $gaps, $qstart, $qend, $sstart, $send, $evalue, $bitscore) = split(/\t/);
if($qstart < $qend){
$sign = "+";
} elsif($qstart > $qend){
$sign = "-";
} else {
die "Unexpected qstart and end";
}
$bitscore =~ s/^\s*(.*?)\s*$//;
print $outFile "$sseqid\t$db\t$prog\t$qstart\t$qend\t$bitscore\t$sign\t.\t$subid\n";
}
我的输出是
scf_62525_290.contig_1 BLAST blastn 1 3954 7302 + . scf_62525_290.contig_1
scf_62525_290.contig_1 BLAST blastn 4178 6577 4433 + . scf_62525_290.contig_1
scf_62525_290.contig_1 BLAST blastn 3953 4114 300 + . scf_62525_290.contig_1
scf_62525_290.contig_1 BLAST blastn 4115 4178 119 + . scf_62525_290.contig_1
scf_62525_1067.contig_1 BLAST blastn 1 1665 3075 + . scf_62525_1067.contig_1
scf_62525_163.contig_1 BLAST blastn 7 357 612 + . scf_62525_163.contig_1
scf_62525_4028.contig_1 BLAST blastn 1 1321 2436 + . scf_62525_4028.contig_1
scf_62525_4028.contig_1 BLAST blastn 1319 2231 1687 + . scf_62525_4028.contig_1
scf_62525_4028.contig_1 BLAST blastn 1275 1321 87.9 + . scf_62525_4028.contig_1
我想把它改成这个输出
scf_62525_290.contig_1 BLAST blastn 1 3954 7302 + . scf_62525_290.contig_1.t1.d1
scf_62525_290.contig_1 BLAST blastn 4178 6577 4433 + . scf_62525_290.contig_1.t1.d2
scf_62525_290.contig_1 BLAST blastn 3953 4114 300 + . scf_62525_290.contig_1.t1.d3
scf_62525_290.contig_1 BLAST blastn 4115 4178 119 + . scf_62525_290.contig_1.t1.d4
scf_62525_1067.contig_1 BLAST blastn 1 1665 3075 + . scf_62525_1067.contig_1.t1.d1
scf_62525_163.contig_1 BLAST blastn 7 357 612 + . scf_62525_163.contig_1.t1.d1
scf_62525_4028.contig_1 BLAST blastn 1 1321 2436 + . scf_62525_4028.contig_1.t1.d1
scf_62525_4028.contig_1 BLAST blastn 1319 2231 1687 + . scf_62525_4028.contig_1.t1.d2
scf_62525_4028.contig_1 BLAST blastn 1275 1321 87.9 + . scf_62525_4028.contig_1.t1.d3
有没有简单的方法? 谢谢
这是一些示例输入:
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 3954 0 0 1 3954 23690 27643 0.0 7302
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 2400 0 0 4178 6577 28076 30475 0.0 4433
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 162 0 0 3953 4114 27722 27883 1e-79 300
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 64 0 0 4115 4178 27957 28020 4e-25 119
Ppluv_s011067g00001.1 scf_62525_1067.contig_1 100.00 1665 0 0 1 1665 4944 6608 0.0 3075
Ppluv_s010163g00001.1 scf_62525_163.contig_1 97.77 359 0 8 7 357 797 439 8e-175 612
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 1321 0 0 1 1321 2322 1002 0.0 2436
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 913 0 0 1319 2231 924 12 0.0 1687
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 47 0 0 1275 1321 992 946 4e-16 87.9
Ppluv_s014028g00001.1 scf_62525_3545.contig_1 79.23 1343 241 38 1 1321 1 1327 0.0 902
Ppluv_s014028g00001.1 scf_62525_1712.contig_1 74.27 1951 403 99 340 2227 3076 4990 0.0 732
Ppluv_s014028g00001.1 scf_62525_817.contig_1 82.74 730 87 39 1378 2105 23175 22483 2e-174 614
Ppluv_s014028g00001.1 scf_62525_177.contig_1 76.37 804 178 12 1320 2117 29453 28656 1e-116 422
Ppluv_s014028g00001.1 scf_62525_177.contig_1 75.28 615 134 18 1326 1937 36037 35438 2e-73 278
我可能会使用计数器来完成。从我在这段代码中看到的情况来看,如果 $subid 相同,则它们会递增编号,扩展名似乎相同。也许你可以这样做
use strict;
use warnings;
use diagnostics;
my $db = "BLAST";
my $prog = "blastn";
my $subid = "";
open(my $inFile, $ARGV[0]) || die "Could not open file '$ARGV[0]' $!";
open(my $outFile, ">$ARGV[1]") || die "Could not find file '>$ARGV[1]' $!";
print $outFile "##gff-version 3\n#\n#\n";
my $cnt = 0;
my ($tmp,$fn);
while(<$inFile>){
my ($qseqid, $sseqid, $pident, $length, $mismatch, $gaps, $qstart, $qend, $sstart, $send, $evalue, $bitscore) = split(/\t/);
my $suffix = ".t1.d${cnt}";
if (!$tmp){
$tmp = $subid;
}
if ($tmp eq $subid){
$cnt++;
$fn = "${subid}${suffix}";
$tmp = $subid;
}
else {
$cnt = 1;
$fn = "${subid}${suffix}";
$tmp = $subid;
}
if($qstart < $qend){
$sign = "+";
} elsif($qstart > $qend){
$sign = "-";
} else {
die "Unexpected qstart and end";
}
$bitscore =~ s/^\s*(.*?)\s*$//;
print $outFile "$sseqid\t$db\t$prog\t$qstart\t$qend\t$bitscore\t$sign\t.\t$fn\n";
}
这是未经测试的,但你明白了
使用 hash of hashes 记录您看到每个 ID 对的次数:
use strict;
use warnings;
my $db = "BLAST";
my $prog = "blastn";
my %unique;
print "##gff-version 3\n#\n#\n";
while (<DATA>) {
my @fields = split;
my $sseqid = $fields[1];
my $qstart = $fields[6];
my $qend = $fields[7];
my $bitscore = $fields[11];
my $sign;
if ($qstart < $qend) {
$sign = "+";
} elsif ($qstart > $qend) {
$sign = "-";
} else {
die "Unexpected qstart and end";
}
my @id_parts = split(/[_.]/, $sseqid);
my $sub_id = ++$unique{$id_parts[1]}{$id_parts[2]};
print join("\t", $sseqid, $db, $prog, $qstart, $qend, $bitscore, $sign, '.', "$sseqid.$sub_id"), "\n";
}
__DATA__
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 3954 0 0 1 3954 23690 27643 0.0 7302
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 2400 0 0 4178 6577 28076 30475 0.0 4433
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 162 0 0 3953 4114 27722 27883 1e-79 300
Ppluv_s010290g00001.1 scf_62525_290.contig_1 100.00 64 0 0 4115 4178 27957 28020 4e-25 119
Ppluv_s011067g00001.1 scf_62525_1067.contig_1 100.00 1665 0 0 1 1665 4944 6608 0.0 3075
Ppluv_s010163g00001.1 scf_62525_163.contig_1 97.77 359 0 8 7 357 797 439 8e-175 612
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 1321 0 0 1 1321 2322 1002 0.0 2436
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 913 0 0 1319 2231 924 12 0.0 1687
Ppluv_s014028g00001.1 scf_62525_4028.contig_1 100.00 47 0 0 1275 1321 992 946 4e-16 87.9
Ppluv_s014028g00001.1 scf_62525_3545.contig_1 79.23 1343 241 38 1 1321 1 1327 0.0 902
Ppluv_s014028g00001.1 scf_62525_1712.contig_1 74.27 1951 403 99 340 2227 3076 4990 0.0 732
Ppluv_s014028g00001.1 scf_62525_817.contig_1 82.74 730 87 39 1378 2105 23175 22483 2e-174 614
Ppluv_s014028g00001.1 scf_62525_177.contig_1 76.37 804 178 12 1320 2117 29453 28656 1e-116 422
Ppluv_s014028g00001.1 scf_62525_177.contig_1 75.28 615 134 18 1326 1937 36037 35438 2e-73 278