在散列数组中,如何统计 'webpages' 具有最独特 'page' 视图的列表?

In an array of hashes, how to count the list of 'webpages' with most unique 'page' views?

我有一个文本文件,其中包含 IP 访问特定页面的次数,示例:

/help_page/1 126.318.035.038
/contact 184.123.665.067
/home 184.123.665.067
/about/2 444.701.448.104
/help_page/1 929.398.951.889
/index 444.701.448.104
/help_page/1 722.247.931.582
/about 061.945.150.735
/help_page/1 646.865.545.408
/home 235.313.352.950

现在我需要通过解析日志文件来打印一个列表,其中大多数页面浏览量从大多数页面浏览量到较少页面浏览量排序,我已经设法获得了正确的结果。

第二个任务是打印显示独特页面浏览量的网页列表,这里我遇到了几个问题。

下面是打印总页面浏览量的代码,从高到低排序:

require 'open-uri'

log_read = File.read('webserver.log')

split_log = log_read.split("\n/") # split_log = array

split_log[0] = split_log[0].sub('/', '')

split_array = split_log.map { |line| line.split(' ') }

# Most views
container = Hash.new(0) # empty

split_array.each do |item|
  container[item[0]] += 1
end

sorted_container = container.sort_by { |_k, v| v }.reverse

# Number of page visits
sorted_container.each do |k, v|
  puts "#{k} has #{v} visits"
end

the result of the above code is : 
about/2 has 90 visits
contact has 89 visits
index has 82 visits
about has 81 visits
help_page/1 has 80 visits
home has 78 visits

现在是第二部分,我被要求显示具有独特页面浏览量的网页列表,我想像这样映射 'split_array':

sorted_unique_views = split_array.map { |h| h.to_a }.uniq.map { |k, v| { k => v } }

which will give me an array of hashes : 
[
{"help_page/1"=>"126.318.035.038"}
{"contact"=>"184.123.665.067"}
{"home"=>"184.123.665.067"}
{"about/2"=>"444.701.448.104"}
{"help_page/1"=>"929.398.951.889"}
{"index"=>"444.701.448.104"}
{"help_page/1"=>"722.247.931.582"}
{"about"=>"061.945.150.735"}
{"help_page/1"=>"646.865.545.408"}
{"home"=>"235.313.352.950"}
{"help_page/1"=>"543.910.244.929"}
....etc ]

我真正想要的是以某种方式遍历 sorted_unique_views=[{...},{...},etc] 并对每个页面对应的唯一 IP 求和,最终结果将看起来像这样:

help_page/1 23
contact 23
home 22
about/2 22
index 23
about 22

我尝试注入,迭代 sorted_unique_views=[{...},{...},etc] ,但我得到:135,这是所有唯一页面的总和意见,或者我得到

{{"help_page/1"=>"126.318.035.038"}=>1} 

如果可能的话,我想要一些指导和反馈,如果分裂然后映射的选择对我来说是正确的。

非常感谢

创建测试文件

我们先创建一个文件1.

text =<<-END
/help_page/1 126.318.035.038
/contact 184.123.665.067
/home 184.123.665.067
/about/2 444.701.448.104
/help_page/1 929.398.951.889
/index 444.701.448.104
/help_page/1 722.247.931.582
/about 061.945.150.735
/help_page/1 646.865.545.408
/home 235.313.352.950
END

FNAME = 'log'
File.write(FNAME, text)
  #=> 256

确认内容。

puts File.read(FNAME)
/help_page/1 126.318.035.038
/contact 184.123.665.067
/home 184.123.665.067
...
/home 235.313.352.950

读取文件并构造一个有用的散列

h = File.foreach(FNAME).with_object(Hash.new { |h,k| h[k] = [] }) do |line,h|
  key, url = line[1..-2].split
  h[key] << url
end
  #=> {"help_page/1"=>["126.318.035.038", "929.398.951.889", "722.247.931.582",
  #                    "646.865.545.408"],
  #    "contact"    =>["184.123.665.067"],
  #    "home"       =>["184.123.665.067", "235.313.352.950"],
  #    "about/2"    =>["444.701.448.104"],
  #    "index"      =>["444.701.448.104"],
  #    "about"      =>["061.945.150.735"]} 

使用此哈希计算感兴趣的对象

确定每个键的查看次数

h.transform_values(&:count)
  #=> {"help_page/1"=>4, "contact"=>1, "home"=>2, "about/2"=>1, "index"=>1, "about"=>1} 

创建页面浏览量下降列表

h.sort_by { |_,a| -a.size }
  #=> [["help_page/1", ["126.318.035.038", "929.398.951.889", "722.247.931.582",
  #                     "646.865.545.408"]],
  #    ["home",    ["184.123.665.067", "235.313.352.950"]],
  #    ["contact", ["184.123.665.067"]],
  #    ["about/2", ["444.701.448.104"]],
  #    ["index",   ["444.701.448.104"]],
  #    ["about",   ["061.945.150.735"]]] 

或者,根据要求:

h.sort_by { |_,a| -a.size }.to_h
  #=> {"help_page/1"=>["126.318.035.038", "929.398.951.889", "722.247.931.582",
  #                    "646.865.545.408"],
  #    "home"       =>["184.123.665.067", "235.313.352.950"],
  #    "contact"    =>["184.123.665.067"],
  #    "about/2"    =>["444.701.448.104"],
  #    "index"      =>["444.701.448.104"],
  #    "about"      =>["061.945.150.735"]} 

确定哪些键只被查看过一次

h.select { |_,a| a.size == 1 }
  #=> {"contact"=>["184.123.665.067"],
  #    "about/2"=>["444.701.448.104"],
  #    "index"=>["444.701.448.104"],
  #    "about"=>["061.945.150.735"]}

说明

参见 IO::write, IO::read, IO::foreach, Enumerator#with_object, Hash::new, Hash#transform_values, Enumerable#count and Enumerable#sort_by2

h 的计算也可以这样写。

h = {}
File.foreach(FNAME) do |line|
  key, url = line[1..-2].split
  h[key] = [] unless h.key?(key)
  h[key] << url
end
h

这解释了 .each_objectHash.new { |h,k| h[k] = [] }line[1..-2] 删除行的第一个字符 (/) 和行尾的换行符 ("\n)。

h.transform_values(&:count)

是 shorthand 用于:

h.transform_values { |v| v.count }

1.出于格式原因,我将 heredoc 的每一行缩进了 4 个空格以下。要 运行 代码,首先 un-indent heredoc 的行。

2。 Class 和模块方法由 class 或模块和方法名称之间的 double-colon 表示(例如,IO::write);实例方法由 class 或模块与实例方法之间的井号表示(例如,Enumerator#each_object)。 IO 方法通常在 class File 上调用(例如,File.foreach ... 而不是 IO.foreach ...)。这是允许的,因为 FileIO 的子 class,因此继承了 IO 的 class 和实例方法。