从 Ruby 中的散列中过滤重复的子字符串

Question

我正在编写一个 Rails 应用程序以从新闻页面获取 RSS 提要，将 part-of-speech 标记应用于标题，从标题和次数中获取 noun-phrases每一个发生。我需要过滤掉属于其他名词短语的 noun-phrases，我正在使用此代码来执行此操作：

filtered_noun_phrases = sorted_noun_phrases.select{|a|
  sorted_noun_phrases.keys.any?{|b| b != a and a.index(b) } }.to_h

所以这样：

{"troops retake main government office"=>2,
 "retake main government office"=>2, "main government office"=>2}

应该变成：

{"troops retake main government office"=>2}

但是，noun-phrases 的排序哈希如下：

{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
 "retake main government office"=>2, "mosul retake government base"=>2,
 "toddler killer shot dead"=>2, "students fighting racism"=>2,
 "retake government base"=>2, "main government office"=>2,
 "white house tourists"=>2, "horn at french zoo"=>2, "government office"=>2,
 "cia hacking tools"=>2, "killer shot dead"=>2, "government base"=>2,
 "boko haram teen"=>2, "horn chainsawed"=>2, "fighting racism"=>2,
 "silver surfers"=>2, "house tourists"=>2, "natural causes"=>2,
 "george michael"=>2, "instagram fame"=>2, "hacking tools"=>2,
 "iraqi forces"=>2, "mosul battle"=>2, "own wedding"=>2, "french zoo"=>2,
 "haram teen"=>2, "hacked tvs"=>2, "shot dead"=>2}

而是仅部分过滤：

{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
 "retake main government office"=>2, "mosul retake government base"=>2,
 "toddler killer shot dead"=>2, "students fighting racism"=>2,
 "retake government base"=>2, "main government office"=>2,
 "white house tourists"=>2, "horn at french zoo"=>2,
 "cia hacking tools"=>2, "killer shot dead"=>2,
 "boko haram teen"=>2}

那么如何从实际有效的哈希中过滤出重复的子字符串？

Answer 1

filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h

- trueunlessfalse

Answer 2

您当前正在做的是选择所有存在作为该短语子串的短语的短语。

对于 "troops retake main government office" 这是正确的，因为我们发现 "retake main government office"。

然而对于 "retake main government office" 我们仍然找到 "main government office"，因此没有过滤掉它。

例如：

 filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h

您可以拒绝存在包含该短语的任何字符串的所有短语。

从 Ruby 中的散列中过滤重复的子字符串

Filtering duplicate substrings from a hash in Ruby

ruby

sorting

hash

substring

pos-tagger