使用 Ruby 和正则表达式进行文本挖掘清理
Text Mining Cleanup with Ruby & Regex
我有一个字数哈希,如下所示:
words = {
"love" => 10,
"hate" => 12,
"lovely" => 3,
"loving" => 2,
"loved" => 1,
"peace" => 14,
"thanks" => 3,
"wonderful" => 10,
"grateful" => 10
# there are more but you get the idea
}
我想确保 "love"、"loved" 和 "loving" 都算作 "love"。所以我将它们的所有计数加在一起作为 "love" 的计数,并删除 "love" 的其余变化。但是,与此同时,我不想让 "lovely" 被算作 "love",所以我保留了它的原样。
所以我最后会得到这样的东西。
words = [
"love" => 13,
"hate" => 12,
"lovely" => 3,
"peace" => 14,
"thanks" => 3,
"wonderful" => 10,
"grateful" => 10
# there are more but you get the idea
]
我有一些代码可以工作,但我认为最后一行的逻辑确实是错误的。我想知道您是否可以帮我解决这个问题或建议更好的方法。
words.select { |k| /\Alov[a-z]*/.match(k) }
words["love"] = purgedWordCount.select { |k| /\Alov[a-z]*/.match(k) }.map(&:last).reduce(:+) - 1 # that 1 is for the 1 for "lovely"; I tried not to hard code it by using words["lovely"], but it messed things up completely, so I had to do this.
words.delete_if { |k| /\Alov[a-z]*/.match(k) && k != "love" && k != "lovely" }
谢谢!
以下为功能无损版
words = {
"love" => 10,
"hate" => 12,
"lovely" => 3,
"loving" => 2,
"loved" => 1,
"peace" => 14,
"thanks" => 3,
"wonderful" => 10,
"grateful" => 10
}
to_love_or_not_to_love = words.partition {|w| w.first =~ /^lov/ && w.first != "lovely"}
{"love" => to_love_or_not_to_love.first.map(&:last).sum}.merge(to_love_or_not_to_love.last.reduce({}) {|m, e| m[e.first] = e.last; m})
=> {"love"=>13, "hate"=>12, "lovely"=>3, "peace"=>14, "thanks"= >3, "wonderful"=>10, "grateful"=>10
words = {
"love" => 10,
"hate" => 12,
"lovely" => 3,
"loving" => 2,
"loved" => 1,
"peace" => 14,
"thanks" => 3,
"wonderful" => 10,
"grateful" => 10
# there are more but you get the idea
}
aggregated_words = words.inject({}) do |memo, (word, count)|
key = word =~ /\Alov.+/ && word != "lovely" ? "love" : word
memo[key] = memo[key].to_i + count
memo
end
> {"love"=>13, "hate"=>12, "lovely"=>3, "peace"=>14, "thanks"=>3, "wonderful"=>10, "grateful"=>10}
我认为,如果你要处理足够大的词汇,那么你真正需要的是一个词干分析器,而不仅仅是一个正则表达式。对词干进行哈希处理将是一个简单而优雅的解决方案。
简单的英语 here,但有许多用于此目的和不同语言的 gem。
我建议如下:
r = /
lov # match 'lov'
(?!ely) # negative lookahead to not match 'ely'
[a-z]+ # match one or more letters
# /x is for 'extended', /i makes it case-independent
/xi
words.each_with_object(Hash.new(0)) { |(k,v),h| (k=~r) ? h["love"]+=v : h[k]=v }
#=> {"love"=>13, "hate"=>12, "lovely"=>3, "peace"=>14, "thanks"=>3,
# "wonderful"=>10, "grateful"=>10}
我有一个字数哈希,如下所示:
words = {
"love" => 10,
"hate" => 12,
"lovely" => 3,
"loving" => 2,
"loved" => 1,
"peace" => 14,
"thanks" => 3,
"wonderful" => 10,
"grateful" => 10
# there are more but you get the idea
}
我想确保 "love"、"loved" 和 "loving" 都算作 "love"。所以我将它们的所有计数加在一起作为 "love" 的计数,并删除 "love" 的其余变化。但是,与此同时,我不想让 "lovely" 被算作 "love",所以我保留了它的原样。
所以我最后会得到这样的东西。
words = [
"love" => 13,
"hate" => 12,
"lovely" => 3,
"peace" => 14,
"thanks" => 3,
"wonderful" => 10,
"grateful" => 10
# there are more but you get the idea
]
我有一些代码可以工作,但我认为最后一行的逻辑确实是错误的。我想知道您是否可以帮我解决这个问题或建议更好的方法。
words.select { |k| /\Alov[a-z]*/.match(k) }
words["love"] = purgedWordCount.select { |k| /\Alov[a-z]*/.match(k) }.map(&:last).reduce(:+) - 1 # that 1 is for the 1 for "lovely"; I tried not to hard code it by using words["lovely"], but it messed things up completely, so I had to do this.
words.delete_if { |k| /\Alov[a-z]*/.match(k) && k != "love" && k != "lovely" }
谢谢!
以下为功能无损版
words = {
"love" => 10,
"hate" => 12,
"lovely" => 3,
"loving" => 2,
"loved" => 1,
"peace" => 14,
"thanks" => 3,
"wonderful" => 10,
"grateful" => 10
}
to_love_or_not_to_love = words.partition {|w| w.first =~ /^lov/ && w.first != "lovely"}
{"love" => to_love_or_not_to_love.first.map(&:last).sum}.merge(to_love_or_not_to_love.last.reduce({}) {|m, e| m[e.first] = e.last; m})
=> {"love"=>13, "hate"=>12, "lovely"=>3, "peace"=>14, "thanks"= >3, "wonderful"=>10, "grateful"=>10
words = {
"love" => 10,
"hate" => 12,
"lovely" => 3,
"loving" => 2,
"loved" => 1,
"peace" => 14,
"thanks" => 3,
"wonderful" => 10,
"grateful" => 10
# there are more but you get the idea
}
aggregated_words = words.inject({}) do |memo, (word, count)|
key = word =~ /\Alov.+/ && word != "lovely" ? "love" : word
memo[key] = memo[key].to_i + count
memo
end
> {"love"=>13, "hate"=>12, "lovely"=>3, "peace"=>14, "thanks"=>3, "wonderful"=>10, "grateful"=>10}
我认为,如果你要处理足够大的词汇,那么你真正需要的是一个词干分析器,而不仅仅是一个正则表达式。对词干进行哈希处理将是一个简单而优雅的解决方案。
简单的英语 here,但有许多用于此目的和不同语言的 gem。
我建议如下:
r = /
lov # match 'lov'
(?!ely) # negative lookahead to not match 'ely'
[a-z]+ # match one or more letters
# /x is for 'extended', /i makes it case-independent
/xi
words.each_with_object(Hash.new(0)) { |(k,v),h| (k=~r) ? h["love"]+=v : h[k]=v }
#=> {"love"=>13, "hate"=>12, "lovely"=>3, "peace"=>14, "thanks"=>3,
# "wonderful"=>10, "grateful"=>10}