如何在一个语句中声明和分配具有出现次数的哈希

Question

我正在从事一个用于自动文本分类的 NLP 项目。这是从 XML 文件构成原始词袋的代码片段。但我的问题如下：在这个序列中，是否可以将这 3 行放在一行中，因为我真的不喜欢 %lemma_words

的空声明

my %lemma_words = ();
foreach my $word (lemmatizer(join(" ", keys %bag_of_words))){
    $lemma_words{$word}++;
}

lemmatizer(join(" ", keys %bag_of_words)) 是一个用 Python::Inline 调用的 python 函数，它 returns 是一个数组。我想知道是否可以在同一行中声明 lemma_words，以词形化标记作为键并将出现作为值的散列（使用地图，nmap ...我不知道。这是整个片段。这个大学任务的唯一准则是脚本必须用 Perl 编写（python 代码越少越好）

#!/usr/bin/perl

use strict;
#use warnings;
use open qw/ :std :encoding(UTF-8)/;

use Inline Python => <<'END_OF_PYTHON';

import spacy
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
nlp = spacy.load('fr_core_news_md')

def lemmatizer(words):
    doc = nlp(words)
    return list(filter(lambda x: x not in list(fr_stop), list(map(lambda token: token.lemma_ , doc))))

END_OF_PYTHON

open my $fh, "<:encoding(utf-8)", "sortiexml-slurp_3208.xml" or die "$!";
# first level bag_of_words. We remove only punct and num
my %bag_of_words = ();
while (my $ligne = <$fh>){
    next if $ligne !~ /^<element>/;
    # let's extract form of the token (word) from an xml file
    if ($ligne =~ /<element><data type="type".+>(.+)<.+<\/element>/){
        $bag_of_words{}++ if length() > 2 and  !~ /\d+/;
    }
}

close $fh;

# now it's time to lemmatize words and counter new occurrences of each token lemmatized and we have a primitive bag of words
# to apply it after with Cosine Similarity or Naive Bayes for automatic text classification (example)
# we also remove stop words
my %lemma_words = ();
foreach my $word (lemmatizer(join(" ", keys %bag_of_words))){
    $lemma_words{$word}++;
}

# here it's simply a debug print to check errors
for my $key (keys %lemma_words) {
    print "$key => $lemma_words{$key}\n";
}

Answer 1

可以在子例程中滚动它...或使用库中提供的那些

对于简单快速的元素频率计数器有List::MoreUtils::frequency

use List::MoreUtils qw(frequency);

my %freq_count = frequency LIST;

LIST 是任何动态生成的列表（即此处的 lemmatizer(...)），或数组变量。

如果您想微调计算的内容，List::UtilsBy

中有 count_by

use List::UtilsBy qw(count_by);

my %freq_count = count_by { $_ } LIST;

同样，LIST 是任何动态生成的列表或数组变量，在您的情况下是 lemmatizer(...)（returns 一个列表）。

为块内代码计算的值返回频率计数； $_ 中依次提供每个元素。所以对于单独的 $_ 计数是针对元素本身的。

List::MoreUtils 和 List::UtilsBy 可能都需要从 CPAN 安装。

如何在一个语句中声明和分配具有出现次数的哈希

How to declare and assign a hash with count of occurrences in one statement

perl