修改 perl 脚本，使单词逐字匹配

Question

我一直在使用这个 perl 脚本（感谢 Jeff Schaller）来匹配两个单独的 csv 文件的标题字段中的 3 个或更多单词。原始问题在这里：

https://unix.stackexchange.com/questions/283942/matching-3-or-more-words-from-fields-in-separate-csv-files?noredirect=1#comment494461_283942

我还根据 meuh 的建议添加了一些异常功能：

#!/bin/perl

my @csv2 = ();
open CSV2, "<csv2" or die;
@csv2=<CSV2>;
close CSV2;

my %csv2hash = ();
for (@csv2) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  $csv2hash{$_} = $title;
}

open CSV1, "<csv1" or die;
while (<CSV1>) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  my @titlewords = split /\s+/, $title;    #/ get words

  my @new;                          #add exception words which shouldn't be matched
  foreach my $t (@titlewords){
  push(@new, $t) if $t !~ /^(and|if|where)$/i;
  }
  @titlewords = @new;
  my $desired = 3;
  my $matched = 0;
  foreach my $csv2 (keys %csv2hash) {
    my $count = 0;
    my $value = $csv2hash{$csv2};
    foreach my $word (@titlewords) {
      ++$count if $value =~ /\b$word\b/i;
      last if $count >= $desired;
    }
    if ($count >= $desired) {
      print "$csv2\n";
      ++$matched;
    }
  }
  print "$_\n" if $matched;
}
close CSV1;

在我的测试过程中，我发现我想调整的一个问题是，如果 csv2 包含单个常用词，例如 the，如果它在 csv1 中被复制了三次或更多次，那么三找到正匹配项。澄清一下：

如果 csv1 包含：

1216454,the important people feel the same way as the others, 15445454, 45445645

^即上面一行

中有3个the实例

如果 csv2 包含：

14564564,the tallest man on earth,546456,47878787

^ 即这一行有一个 the 实例

然后我希望只有一个词被归类为匹配，并且没有输出（基于我想要的匹配词数 - 3），因为其中一个文件中只有一个匹配词的实例.

但是如果：

csv1 包含：

1216454,the important people feel the same way as the others,15445454, 45445645

csv2 包含：

15456456,the only way the man can sing the blues,444545,454545

然后，由于每个标题中有三个匹配词（即每个标题中单词 the 的 3 个实例，那么我希望根据我想要的匹配数量将其归类为匹配标题单词为 3 个或更多，因此生成输出：

1216454,the important people feel the same way as the others,15445454, 45445645
15456456,the only way the man can sing the blues,444545,454545

我想修改脚本，这样如果在一个 csv 中有一个单词的实例，而在另一个 csv 中有多个相同单词的实例，那么它被归类为只有一个匹配项。但是，如果两个文件中都有 3 个单词 the 的实例，那么它仍应归类为三个匹配项。基本上我希望匹配是逐字逐句的。除了这个之外，关于剧本的一切都是完美的，所以我宁愿不完全回到绘图板，因为我对除此之外的一切都很满意。我希望我已经解释清楚了，如果有人需要任何澄清，请告诉我。

Answer 1

如果您只想计算唯一匹配项，您可以使用散列而不是列表来收集来自 csv1 的单词，就像您对 csv2 所做的那样，然后也计算每个单词分别出现的次数：

#!/usr/bin/env perl

my @csv2 = ();
open CSV2, "<csv2" or die;
@csv2=<CSV2>;
close CSV2;

my %csv2hash = ();
for (@csv2) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  $csv2hash{$_} = $title;
}

open CSV1, "<csv1" or die;
while (<CSV1>) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
    my %words;
    $words{$_}++ for split /\s+/, $title;    #/ get words
    ## Collect unique words
    my @titlewords = keys(%words);
  my @new;                          #add exception words which shouldn't be matched
  foreach my $t (@titlewords){
        push(@new, $t) if $t !~ /^(and|if|where)$/i;
  }
  @titlewords = @new;
  my $desired = 3;
  my $matched = 0;
  foreach my $csv2 (keys %csv2hash) {
    my $count = 0;
    my $value = $csv2hash{$csv2};
    foreach my $word (@titlewords) {
            my @matches   = ( $value=~/\b$word\b/ig );
            my $numIncsv2 = scalar(@matches);
            @matches      = ( $title=~/\b$word\b/ig );
            my $numIncsv1 = scalar(@matches);
            ++$count if $value =~ /\b$word\b/i;
            if ($count >= $desired || ($numIncsv1 >= $desired && $numIncsv2 >= $desired)) {
                $count = $desired+1;
                last;
            }
    }
    if ($count >= $desired) {
      print "$csv2\n";
      ++$matched;
    }
  }
  print "$_\n" if $matched;
}
close CSV1;

修改 perl 脚本，使单词逐字匹配

Amend perl script so that words are matched on a word for word basis

perl

text-processing