我能否在字符串数组中找到频繁出现的短语,其中短语仅构成每个字符串的一部分?

Can I find frequently occuring phrases in an array of strings where the phrase only forms part of each string?

之前有人问过这个问题,但从未回答过。

我想搜索一个字符串数组并找到这些字符串中出现频率最高的短语(2 个或更多单词),因此给定:

["hello, my name is Emily, I'm from London", 
"this chocolate from London is  really good", 
"my name is James, what did you say yours was", 
"is he from London?"]

我想找回一些东西:

{"from London" => 3, "my name is" => 2 }

我真的不知道该如何处理。任何建议都会很棒,即使它只是我可以测试的策略。

这不是一个单一阶段的过程,但它是可能的。 Ruby 知道什么是字符,什么是数字,什么是字符串等等,但不知道什么是短语。

您需要:

  1. 从构建短语列表或在线查找列表开始。这将构成匹配过程的基础。

  2. 遍历每个字符串的短语列表,以查看列表中的任何短语的实例是否出现在该字符串中。

  3. 记录字符串中短语的每个实例的计数。

虽然可能没看到,但这是一个相当高级的问题,所以尽量把任务分解成更小的任务。

以下内容可能会让您入门。这是暴力破解,对于大型数据集来说会非常非常慢。

x = ["hello, my name is Emily, I'm from London", 
"this chocolate from London is really good", 
"my name is James, what did you say yours was", 
"is he from London?"]

word_maps = x.flat_map do |line|
  line = line.downcase.scan(/\w+/)
  (2..line.size).flat_map{|ix|line.each_cons(ix).map{|p|p.join(' ')}}
end

word_maps_hash = Hash[word_maps.group_by{|x|x}.reject{|x,y|y.size==1}.map{|x,y|[x,y.size]}]

original_hash_keys = word_maps_hash.keys
word_maps_hash.delete_if{|key, val| original_hash_keys.any?{|ohk| ohk[key] && ohk!=key}}

p word_maps_hash #=> {"from london"=>3, "my name is"=>2}

怎么样

x = ["hello, my name is Emily, I'm from London", 
"this chocolate from London is really good", 
"my name is James, what did you say yours was", 
"is he from London?"]

words = x.map { |phrase| phrase.split(/[^\w\'']+/)}.flatten
word_pairs_array = words.each_cons(2)
word_pairs = word_pairs_array.map {|pair| pair.join(' ')}
counts = Hash.new 0
word_pairs.each {|pair| counts[pair] += 1}
pairs_occuring_twice_or_more = counts.select {|pair, count| count > 1}