从 Ruby 中的散列中过滤重复的子字符串
Filtering duplicate substrings from a hash in Ruby
我正在编写一个 Rails 应用程序以从新闻页面获取 RSS 提要,将 part-of-speech 标记应用于标题,从标题和次数中获取 noun-phrases每一个发生。我需要过滤掉属于其他名词短语的 noun-phrases,我正在使用此代码来执行此操作:
filtered_noun_phrases = sorted_noun_phrases.select{|a|
sorted_noun_phrases.keys.any?{|b| b != a and a.index(b) } }.to_h
所以这样:
{"troops retake main government office"=>2,
"retake main government office"=>2, "main government office"=>2}
应该变成:
{"troops retake main government office"=>2}
但是,noun-phrases 的排序哈希如下:
{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
"retake main government office"=>2, "mosul retake government base"=>2,
"toddler killer shot dead"=>2, "students fighting racism"=>2,
"retake government base"=>2, "main government office"=>2,
"white house tourists"=>2, "horn at french zoo"=>2, "government office"=>2,
"cia hacking tools"=>2, "killer shot dead"=>2, "government base"=>2,
"boko haram teen"=>2, "horn chainsawed"=>2, "fighting racism"=>2,
"silver surfers"=>2, "house tourists"=>2, "natural causes"=>2,
"george michael"=>2, "instagram fame"=>2, "hacking tools"=>2,
"iraqi forces"=>2, "mosul battle"=>2, "own wedding"=>2, "french zoo"=>2,
"haram teen"=>2, "hacked tvs"=>2, "shot dead"=>2}
而是仅部分过滤:
{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
"retake main government office"=>2, "mosul retake government base"=>2,
"toddler killer shot dead"=>2, "students fighting racism"=>2,
"retake government base"=>2, "main government office"=>2,
"white house tourists"=>2, "horn at french zoo"=>2,
"cia hacking tools"=>2, "killer shot dead"=>2,
"boko haram teen"=>2}
那么如何从实际有效的哈希中过滤出重复的子字符串?
filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h
- trueunlessfalse
您当前正在做的是选择所有存在作为该短语子串的短语的短语。
对于 "troops retake main government office" 这是正确的,因为我们发现 "retake main government office"。
然而对于 "retake main government office" 我们仍然找到 "main government office",因此没有过滤掉它。
例如:
filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h
您可以拒绝存在包含该短语的任何字符串的所有短语。
我正在编写一个 Rails 应用程序以从新闻页面获取 RSS 提要,将 part-of-speech 标记应用于标题,从标题和次数中获取 noun-phrases每一个发生。我需要过滤掉属于其他名词短语的 noun-phrases,我正在使用此代码来执行此操作:
filtered_noun_phrases = sorted_noun_phrases.select{|a|
sorted_noun_phrases.keys.any?{|b| b != a and a.index(b) } }.to_h
所以这样:
{"troops retake main government office"=>2,
"retake main government office"=>2, "main government office"=>2}
应该变成:
{"troops retake main government office"=>2}
但是,noun-phrases 的排序哈希如下:
{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
"retake main government office"=>2, "mosul retake government base"=>2,
"toddler killer shot dead"=>2, "students fighting racism"=>2,
"retake government base"=>2, "main government office"=>2,
"white house tourists"=>2, "horn at french zoo"=>2, "government office"=>2,
"cia hacking tools"=>2, "killer shot dead"=>2, "government base"=>2,
"boko haram teen"=>2, "horn chainsawed"=>2, "fighting racism"=>2,
"silver surfers"=>2, "house tourists"=>2, "natural causes"=>2,
"george michael"=>2, "instagram fame"=>2, "hacking tools"=>2,
"iraqi forces"=>2, "mosul battle"=>2, "own wedding"=>2, "french zoo"=>2,
"haram teen"=>2, "hacked tvs"=>2, "shot dead"=>2}
而是仅部分过滤:
{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
"retake main government office"=>2, "mosul retake government base"=>2,
"toddler killer shot dead"=>2, "students fighting racism"=>2,
"retake government base"=>2, "main government office"=>2,
"white house tourists"=>2, "horn at french zoo"=>2,
"cia hacking tools"=>2, "killer shot dead"=>2,
"boko haram teen"=>2}
那么如何从实际有效的哈希中过滤出重复的子字符串?
filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h
- trueunlessfalse
您当前正在做的是选择所有存在作为该短语子串的短语的短语。
对于 "troops retake main government office" 这是正确的,因为我们发现 "retake main government office"。
然而对于 "retake main government office" 我们仍然找到 "main government office",因此没有过滤掉它。
例如:
filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h
您可以拒绝存在包含该短语的任何字符串的所有短语。