消除perl子程序中的空文件

Question

我想在下一个脚本中添加代码以消除那些空输出文件。

脚本将单个fastq文件或文件夹中的所有fastq文件转换为fasta格式，所有输出的fasta文件与fastq文件保持相同的名称；该脚本提供了一个选项来排除所有呈现确定数量的 NNN 重复的序列 (NNNNNNNNNNNNNNNNNATAGTGAAGAATGCGACGTACAGGATCATCTA)，我添加此选项是因为某些序列在序列中仅存在 NNNNN，例如：如果 -n 选项等于 15 (-n 15) 它将排除所有存在 15 次或以上 N 次重复的序列，至此代码运行良好，但它生成一个空文件（在那些存在 15 次或更多次 N 次重复的序列被排除的那些 fastq 文件中）。我想消除所有空文件（没有序列）并添加一个计数，计算有多少文件因为它是空的而被消除。

代码：

#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Long;

my ($infile, $file_name, $file_format, $N_repeat, $help, $help_descp,
    $options, $options_descrp, $nofile, $new_file, $count);

my $fastq_extension = "\.fastq";

GetOptions (
    'in=s'      => $infile,
    'N|n=i'     =>$N_repeat,
    'h|help'    =>$help,
    'op'        =>$options
);

 # Help

 $help_descp =(qq(              
              Ussaje:
              fastQF -in fastq_folder/ -n 15
                      or
              fastQF -in file.fastq -n 15
              ));

 $options_descrp =(qq(

            -in      infile.fastq or fastq_folder/                  required
            -n       exclude sequences with more than N repeat      optional
            -h       Help description                               optional
            -op      option section                                 optional
                   ));

 $nofile =(qq(
            ERROR:  "No File or Folder Were Chosen !"

                Usage:
                    fastQF -in folder/

                Or See -help or -op section
           ));

 # Check Files 

    if ($help){
        print "$help_descp\n";
        exit;
    }
    elsif ($options){
        print "$options_descrp\n";
        exit;
    }

    elsif (!$infile){
        print "$nofile\n";
        exit;
    }


 #Subroutine to convert from fastq to fasta

    sub fastq_fasta {

        my $file = shift;
        ($file_name = $file) =~ s/(.*)$fastq_extension.*//;

# eliminate old files 

        my $oldfiles= $file_name.".fasta";

        if ($oldfiles){
            unlink $oldfiles;
        }

        open LINE,    '<',   $file             or die "can't read or open $file\n";
        open OUTFILE, '>>', "$file_name.fasta" or die "can't write $file_name\n";

        while (
            defined(my $head    = <LINE>)       &&
            defined(my $seq     = <LINE>)       &&
            defined(my $qhead   = <LINE>)       &&
            defined(my $quality = <LINE>)
        ) {
                substr($head, 0, 1, '>');


                if (!$N_repeat){
                    print OUTFILE $head, $seq;


                }

                elsif ($N_repeat){

                        my $number_n=$N_repeat-1;

                    if ($seq=~ m/(n){$number_n}/ig){
                        next;
                    }
                    else{
                        print OUTFILE $head, $seq;
                    }
                }
        }

        close OUTFILE;
        close LINE;
    }

 # execute the subrutine to extract the sequences

    if (-f $infile) {           # -f es para folder !!
        fastq_fasta($infile);
    }
    else {
        foreach my $file (glob("$infile/*.fastq")) {
        fastq_fasta($file);
        }
    }

 exit;

我尝试在子例程之外使用下一个代码（在退出之前），但它只适用于最后一个文件：

$new_file =$file_name.".fasta";
        foreach ($new_file){

            if (-z $new_file){
                $count++;
                if ($count==1){
                    print "\n\"The choosen File present not sequences\"\n";
                    print " \"or was excluded due to -n $N_repeat\"\n\n";

                }
                elsif ($count >=1){
                    print "\n\"$count Files present not sequences\"\n";
                    print " \" or were excluded due to -n $N_repeat\"\n\n";

                }

                unlink $new_file;
            }
        }

我刚刚在子例程中尝试了类似的东西，但是最后一个代码不起作用!!!!

任何建议!!!!???

非常感谢！！！

Answer 1

最简单的方法可能是在子程序中添加一个计数器来跟踪输出文件中的序列数：

sub fastq_fasta {
    my $counter1 = 0;
    my $file = shift;
    ($file_name = $file) =~ s/(.*)$fastq_extension.*//;

# eliminate old files 

    my $oldfiles= $file_name.".fasta";

    if ($oldfiles){
        unlink $oldfiles;
    }

    open LINE,    '<',   $file             or die "can't read or open $file\n";
    open OUTFILE, '>>', "$file_name.fasta" or die "can't write $file_name\n";

    while (
        defined(my $head    = <LINE>)       &&
        defined(my $seq     = <LINE>)       &&
        defined(my $qhead   = <LINE>)       &&
        defined(my $quality = <LINE>)
    ) {
            $counter1 ++;
            substr($head, 0, 1, '>');


            if (!$N_repeat){
                print OUTFILE $head, $seq;


            }

            elsif ($N_repeat){

                    my $number_n=$N_repeat-1;

                if ($seq=~ m/(n){$number_n}/ig){
                    $counter1 --;
                    next;
                }
                else{
                    print OUTFILE $head, $seq;
                }
            }
    }

    close OUTFILE;
    close LINE;
    return $counter1;
}

您可以在返回的计数为零时删除文件：

if (-f $infile) {           # -f es para folder !!
    fastq_fasta($infile);
}
else {
    foreach my $file (glob("$infile/*.fastq")) {
        if (fastq_fasta($file) == 0) { 
            $file =~ s/(.*)$fastq_extension.*/.fasta/;
            unlink $file; 
        }
    }
}

Answer 2

你应该检查一下，在我们的 fastq_fasta 子程序结束时是否有东西被写入了你的新文件。只需将您的代码放在 close OUTFILE 语句之后：

close OUTFILE;
close LINE;

my $outfile = $file_name.".fasta";
if (-z $outfile)
{
   unlink $outfile || die "Error while deleting '$outfile': $!";
}

此外，最好也将 die/warn 语句添加到其他取消链接行。应删除空文件。

如果您不固定使用 perl，但允许使用 sed 和 bash 循环，也许是另一种解决方案：

for i in *.fastq
do
   out=$(dirname "$i")/$(basename "$i" .fastq).fasta
   sed -n '1~4{s/^@/>/;N;p}' "$i" > "$out"
   if [ -z $out ]
   then
      echo "Empty output file $out"
      rm "$out"
   fi
done

希望对您有所帮助！

最佳弗兰克

消除perl子程序中的空文件

eliminate empty files in a subroutine in perl

perl

bioinformatics

fasta

sequencing

fastq