Perl：比较两个文件中的单词

Question

这是我当前的脚本，用于尝试将 file_all.txt 中的单词与 file2.txt 中的单词进行比较。它应该打印出 file_all 中不在 file2.

中的任何单词

我需要将它们格式化为每行一个字，但这不是更紧迫的问题。

我是 Perl 的新手...我了解 C 和 Python 但这有点棘手，我知道我的变量分配已关闭。

 use strict;
 use warnings;

 my $file2 = "file_all.txt";   %I know my assignment here is wrong
 my $file1 = "file2.txt";

 open my $file2, '<', 'file2' or die "Couldn't open file2: $!";
 while ( my $line = <$file2> ) {
     ++$file2{$line};
     }

 open my $file1, '<', 'file1' or die "Couldn't open file1: $!";
 while ( my $line = <$file1> ) {
     print $line unless $file2{$line};
     }

编辑：哦，它应该忽略大小写...就像 Pie 在比较时与 PIE 相同。并删除撇号

这些是我遇到的错误：

"my" variable $file2 masks earlier declaration in same scope at absent.pl line 9.
"my" variable $file1 masks earlier declaration in same scope at absent.pl line 14.
Global symbol "%file2" requires explicit package name at absent.pl line 11.
Global symbol "%file2" requires explicit package name at absent.pl line 16.
Execution of absent.pl aborted due to compilation errors.

Answer 1

你快到了。

% 印记表示哈希。您不能将文件名存储在哈希中，为此您需要一个标量。

my $file2 = 'file_all.txt';
my $file1 = 'file2.txt';

您需要一个哈希来计算出现次数。

my %count;

要打开文件，请指定其名称 - 它存储在标量中，您还记得吗？

open my $FH, '<', $file2 or die "Can't open $file2: $!";

然后，逐行处理文件：

while (my $line = <$FH> ) {
    chomp;                # Remove newline if present.
    ++$count{lc $line};   # Store the lowercased string.
}

然后，打开第二个文件，逐行处理，再次使用lc得到小写的字符串

要删除撇号，请使用替换：

$line =~ s/'//g;  # Replace ' by nothing globally (i.e. everywhere).

Answer 2

问题是以下两行：

 my %file2 = "file_all.txt";
 my %file1 = "file2.txt";

在这里，您要将单个值（在 Perl 中称为 SCALAR）分配给哈希（由 % 标记表示）。哈希由箭头运算符 (=>) 分隔的键值对组成。例如

my %hash = ( key => 'value' );

散列需要偶数个参数，因为它们必须同时被赋予 key 和 value。您目前只给每个 Hash 一个值，因此会抛出此错误。

要为 SCALAR 赋值，您可以使用 $ 印记：

 my $file2 = "file_all.txt";
 my $file1 = "file2.txt";

Answer 3

您的错误信息：

"my" variable $file2 masks earlier declaration in same scope at absent.pl line 9.
"my" variable $file1 masks earlier declaration in same scope at absent.pl line 14.
Global symbol "%file2" requires explicit package name at absent.pl line 11.
Global symbol "%file2" requires explicit package name at absent.pl line 16.
Execution of absent.pl aborted due to compilation errors.

您正在为 $file2 分配一个文件名，然后您使用 open my $file2 ... 在第二种情况下使用我的 $file2 掩盖了第一种情况下的使用。然后，在 while 循环体中，你假装有一个散列 table %file2，但你根本没有声明它。

您应该使用更具描述性的变量名称以避免概念上的混淆。

例如：

 my @filenames = qw(file_all.txt file2.txt);

在 integer suffixes is a code smell 中使用变量。

然后，将常见任务分解为子程序。在这种情况下，您需要的是：1) 一个接受文件名和 returns 该文件中 table 个单词的函数，以及 2) 一个接受文件名和查找 table，并打印文件中但未出现在查找中的单词 table。

#!/usr/bin/env perl

use strict;
use warnings;

use Carp qw( croak );

my @filenames = qw(file_all.txt file2.txt);

print "$_\n" for @{ words_notseen(
    $filenames[0],
    words_from_file($filenames[1])
)};

sub words_from_file {
    my $filename = shift;
    my %words;

    open my $fh, '<', $filename
        or croak "Cannot open '$filename': $!";

    while (my $line = <$fh>) {
        $words{ lc $_ } = 1 for split ' ', $line;
    }

    close $fh
        or croak "Failed to close '$filename': $!";

    return \%words;
}

sub words_notseen {
    my $filename = shift;
    my $lookup = shift;

    my %words;

    open my $fh, '<', $filename
        or croak "Cannot open '$filename': $!";

    while (my $line = <$fh>) {
        for my $word (split ' ', $line) {
            unless (exists $lookup->{$word}) {
                $words{ $word } = 1;
            }
        }
    }

    return [ keys %words ];
}

Answer 4

正如您在问题中提到的：它应该打印出 file_all 中不在 file2[=16= 中的任何单词]

下面的小代码就是这样做的：

#!/usr/bin/perl
use strict;
use warnings;

my ($file1, $file2) = qw(file_all.txt file2.txt);

open my $fh1, '<', $file1 or die "Can't open $file1: $!";
open my $fh2, '<', $file2 or die "Can't open $file2: $!";

while (<$fh1>)
{
    last if eof($fh2);
    my $compline = <$fh2>;
    chomp($_, $compline);
    if ($_ ne $compline)
    {
        print "$_\n";
    }
}

file_all.txt:

ab
cd
ee
ef
gh
df

file2.txt:

zz
yy
ee
ef
pp
df

输出：

ab
cd
gh

Perl：比较两个文件中的单词

Perl: comparing words in two files

perl

file-io

stdout