fasta文件的反补
Reverse complement of fasta file
我正在尝试在多 fasta 文件中获取 RNA 的反向互补
输入:
>cel-mir-39 MI0010 C elegans miR-39
UAUACCGAGAGCCCAGCUGAUUUCGUCUUGGUAAUAAGCUCGUCAUUGAGAUUAUCACCGGGUGUAAAUCAGCUUGGCUCAAAAAAAA
>cel-let-7 MI0001 C elegans let-7
UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGGGGGGG
输出:
>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
但我得到的是这个:
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
93-Rim snucele G 0100IM 93-rim-leg
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
7-tel snucele G 1000IM 7-tel-leg
我的代码:
#!/usr/bin/perl
use strict;
use warnings;
print "type in the path of the file\n";
my $file_name = <>;
chomp($file_name);
open (FASTA, $file_name) or die "error #!";
$/ = ">";
<FASTA>;
while (my $entry = <FASTA>){
$entry = reverse $entry;
$entry =~ tr/ACGUacgu/UGCAugca/;
print "$entry \n";
}
close(FASTA);
如何只反转序列而不反转 header?
谢谢
读取由 >
分隔的记录是个好主意,因为它一次给您整个块。但是,这里您要处理和合并行而不是 header,从而区分行。逐行阅读更清晰
sequence-line 是特定的:全部大写,仅此而已。空行分隔要处理的记录。剩下的可能性就是header。通过连接与其模式匹配的行来组装序列,一旦我们找到空白行,它就会被处理和打印。
open (FASTA, $file_name) or die "error $!";
# sequence, built by joining lines =~ /^[A-Z]+$/
my $sequence = '';
while (my $entry = <FASTA>)
{
if ($entry =~ m/^[A-Z]+$/) {
# Assemble the sequence from separate lines
chomp($entry);
$sequence .= $entry;
}
elsif ($entry =~ m/^\s*$/) {
# process and print the sequence and blank line, reset for next
$sequence = reverse $sequence;
$sequence =~ tr/ACGUacgu/UGCAugca/;
print "$sequence\n";
print "\n";
$sequence = '';
}
else { # header
print $entry;
}
}
# Print the last sequence if the file didn't end with blank line
if (length $sequence) {
$sequence = reverse $sequence;
$sequence =~ tr/ACGUacgu/UGCAugca/;
print "$sequence\n";
}
^
和$
是锚点,用于字符串的开头和结尾。所以匹配序列的正则表达式要求整行严格大写。另一个正则表达式只允许可选的 space \s*
,指定一个空行。
序列处理是从题中复制过来的
TXR解法:
@(bind compl @(hash-from-pairs (zip "ACGUacgu" "UGCAugca")))
@(repeat)
>@header
@ (collect)
@rna
@ (until)
@ (end)
@ (output)
>@header
@(mapcar compl (reverse (cat-str rna)))
@ (end)
@(end)
运行:
$ txr revcomp.txr data
>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
此变体将输出格式化为 46 列,与原来的一样:
@(bind compl @(hash-from-pairs (zip "ACGUacgu" "UGCAugca")))
@(repeat)
>@header
@ (collect)
@rna
@ (until)
@ (end)
@ (output)
>@header
@ (repeat :vars ((crna (tuples 46 (mapcar compl (reverse (cat-str rna)))))))
@crna
@ (end)
@ (end)
@(end)
运行:
$ txr revcomp.txr data
>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGA
CGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAA
UAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
尝试如下操作
首先我用换行符分割数据。并将 header 存储到 $header
中,其余数据存储在 @ar
中。
然后用换行符加入数组并存入$entry
。然后执行替换以从 RNA 序列中删除 \n>\r\s
个字符。
然后像往常一样反转字符串并执行翻译。最后通过print语句得到输出。
open my $fh,"<","filename.text" or die"error opening $!";
$/ = ">";
<$fh>;
while (<$fh>)
{
my ($header,@ar) = split("\n",$_);
my $entry =join("\n",@ar);
$entry=~s/\n|\r|>|\s//g;
$entry = reverse $entry;
$entry =~ tr/ACGUacgu/UGCAugca/;
print ">$header\n$entry\n\n";
}
我正在尝试在多 fasta 文件中获取 RNA 的反向互补
输入:
>cel-mir-39 MI0010 C elegans miR-39
UAUACCGAGAGCCCAGCUGAUUUCGUCUUGGUAAUAAGCUCGUCAUUGAGAUUAUCACCGGGUGUAAAUCAGCUUGGCUCAAAAAAAA
>cel-let-7 MI0001 C elegans let-7
UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGGGGGGG
输出:
>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
但我得到的是这个:
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
93-Rim snucele G 0100IM 93-rim-leg
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
7-tel snucele G 1000IM 7-tel-leg
我的代码:
#!/usr/bin/perl
use strict;
use warnings;
print "type in the path of the file\n";
my $file_name = <>;
chomp($file_name);
open (FASTA, $file_name) or die "error #!";
$/ = ">";
<FASTA>;
while (my $entry = <FASTA>){
$entry = reverse $entry;
$entry =~ tr/ACGUacgu/UGCAugca/;
print "$entry \n";
}
close(FASTA);
如何只反转序列而不反转 header? 谢谢
读取由 >
分隔的记录是个好主意,因为它一次给您整个块。但是,这里您要处理和合并行而不是 header,从而区分行。逐行阅读更清晰
sequence-line 是特定的:全部大写,仅此而已。空行分隔要处理的记录。剩下的可能性就是header。通过连接与其模式匹配的行来组装序列,一旦我们找到空白行,它就会被处理和打印。
open (FASTA, $file_name) or die "error $!";
# sequence, built by joining lines =~ /^[A-Z]+$/
my $sequence = '';
while (my $entry = <FASTA>)
{
if ($entry =~ m/^[A-Z]+$/) {
# Assemble the sequence from separate lines
chomp($entry);
$sequence .= $entry;
}
elsif ($entry =~ m/^\s*$/) {
# process and print the sequence and blank line, reset for next
$sequence = reverse $sequence;
$sequence =~ tr/ACGUacgu/UGCAugca/;
print "$sequence\n";
print "\n";
$sequence = '';
}
else { # header
print $entry;
}
}
# Print the last sequence if the file didn't end with blank line
if (length $sequence) {
$sequence = reverse $sequence;
$sequence =~ tr/ACGUacgu/UGCAugca/;
print "$sequence\n";
}
^
和$
是锚点,用于字符串的开头和结尾。所以匹配序列的正则表达式要求整行严格大写。另一个正则表达式只允许可选的 space \s*
,指定一个空行。
序列处理是从题中复制过来的
TXR解法:
@(bind compl @(hash-from-pairs (zip "ACGUacgu" "UGCAugca")))
@(repeat)
>@header
@ (collect)
@rna
@ (until)
@ (end)
@ (output)
>@header
@(mapcar compl (reverse (cat-str rna)))
@ (end)
@(end)
运行:
$ txr revcomp.txr data
>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
此变体将输出格式化为 46 列,与原来的一样:
@(bind compl @(hash-from-pairs (zip "ACGUacgu" "UGCAugca")))
@(repeat)
>@header
@ (collect)
@rna
@ (until)
@ (end)
@ (output)
>@header
@ (repeat :vars ((crna (tuples 46 (mapcar compl (reverse (cat-str rna)))))))
@crna
@ (end)
@ (end)
@(end)
运行:
$ txr revcomp.txr data
>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGA
CGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAA
UAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
尝试如下操作
首先我用换行符分割数据。并将 header 存储到 $header
中,其余数据存储在 @ar
中。
然后用换行符加入数组并存入$entry
。然后执行替换以从 RNA 序列中删除 \n>\r\s
个字符。
然后像往常一样反转字符串并执行翻译。最后通过print语句得到输出。
open my $fh,"<","filename.text" or die"error opening $!";
$/ = ">";
<$fh>;
while (<$fh>)
{
my ($header,@ar) = split("\n",$_);
my $entry =join("\n",@ar);
$entry=~s/\n|\r|>|\s//g;
$entry = reverse $entry;
$entry =~ tr/ACGUacgu/UGCAugca/;
print ">$header\n$entry\n\n";
}