如何优化代码以消除文本 Perl 中的停用词

Question

我有一段代码读取一个文本文件和一个包含停用词列表的文件，但是这段代码执行起来很费时间，如何优化这段代码？

#!/usr/bin/perl

use strict;
use warnings;
print "choose the name of the result file\n";
my $fic = <STDIN>;

open( FIC1, ">$fic" );

my @stops;
my @file;

use File::Copy;

open( STOPWORD, "C:\ats\stop-ats" ) or die "Can't Open: $!\n";

@stops = <STOPWORD>;
while (<STOPWORD>)    #read each line into $_
{
    chomp @stops;     # Remove newline from $_
    push @stops, $_;  # add the line to @triggers
}



close STOPWORD;

open( FILE, "C:\ats\ats" ) or die "Cannot open FILE";

while (<FILE>) {
    $line = $_;

    #print  $line;
    my @words = split( /\s/, $line );
    foreach $word (@words) {
        chomp($word);
        foreach $wor (@stops) {
            chomp($wor);
            if ( $word eq $wor ) {

                #print   "$wor\n";
                $word = '';

            }
        }

        print FIC1 $word;
        print FIC1 " ";

    }
    print FIC1 "\n";
}
exit 0;

代码处理一个文本文件需要很长时间，如何优化这段代码

Answer 1

您的代码运行缓慢的主要原因是它为输入中的每个词循环遍历停用词数组。这里的标准方法是使用停用词的散列而不是数组。

此外，一旦确定没有新元素进入数组，就更清楚地压缩整个数组，而不是一次又一次地压缩它的元素。

如评论中所述，while (<STOPWORDS>) 循环不会执行，因为您通过在上一行的列表上下文中读取文件句柄来耗尽文件句柄。

您还没有提供示例输入。如果你想从一个单词文件中排除停用词，没关系，但如果你想处理一个真实的文本，你将不得不做更多的工作来找到停用词的出现：它们可以有不同的大小写，而且它们是' 仅由空格分隔，还有标点符号。

您可以从这里开始：

#!/usr/bin/perl
use warnings;
use strict;

open my $STOP, 'stop-ats' or die "Can't Open: $!\n";
my %stops;
while (<$STOP>) {
    chomp;
    $stops{$_} = 1;
}

open my $TEXT, '<', 'ats' or die "Cannot open FILE: $!";
while (<$TEXT>) {
    my @words = split /([[:alpha:]]+)/;
    for my $word (@words) {
        print $word unless $stops{ lc $word };
    }
}

如何优化代码以消除文本 Perl 中的停用词

How to optimise a code to eliminate stopwords from a text Perl

algorithm

optimization

perl

stop-words