拒绝将信息存储在文件中

Rejecting info from being stored in a file

我有一个使用 Mechanize 搜索 Google 的工作程序,但是当该程序搜索 Google 时,它还会拉出类似于 http://webcache.googleusercontent.com/.

的网站

我想拒绝将该站点存储在文件中。所有网站的网址结构都不同。

源代码:

require 'mechanize'

PATH = Dir.pwd
SEARCH = "test"

def info(input)
  puts "[INFO]#{input}"
end

def get_urls
  info("Searching for sites.")
  agent = Mechanize.new
  page = agent.get('http://www.google.com/')
  google_form = page.form('f')
  google_form.q = "#{SEARCH}"
  url = agent.submit(google_form, google_form.buttons.first)
  url.links.each do |link|
    if link.href.to_s =~ /url.q/
      str = link.href.to_s
      str_list = str.split(%r{=|&}) 
      urls_to_log = str_list[1]
      success("Site found: #{urls_to_log}")
      File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
    end
  end
  info("Sites dumped into #{PATH}/temp/sites.txt")
end

get_urls

文本文件:

http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
http://www.speedtest.net/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.speedtest.net/results.php
http://www.speedtest.net/mobile/
http://www.speedtest.net/about.php
https://support.speedtest.net/
https://en.wikipedia.org/wiki/Test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:R94CAo00wOYJ
https://en.wikipedia.org/wiki/Test%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.test.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:S92tylTr1V8J
https://www.test.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.speakeasy.net/speedtest/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:sCEGhiP0qxEJ:https://www.speakeasy.net/speedtest/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.google.com/webmasters/tools/mobile-friendly/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:WBvZnqZfQukJ:https://www.google.com/webmasters/tools/mobile-friendly/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.humanmetrics.com/cgi-win/jtypes2.asp
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:w_lAt3mgXcoJ:http://www.humanmetrics.com/cgi-win/jtypes2.asp%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://speedtest.xfinity.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:snNGJxOQROIJ:http://speedtest.xfinity.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:1sMSoJBXydo
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.16personalities.com/free-personality-test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:SQzntHUEffkJ
https://www.16personalities.com/free-personality-test%252Btest%26gbv%3D%26%26ct%3Dclnk
https://www.xamarin.com/test-cloud
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:ypEu7XAFM8QJ:
https://www.xamarin.com/test-cloud%252Btest%26gbv%3D1%26%26ct%3Dclnk

现在可以了。我对 success('log') 有疑问,我不知道为什么但评论了它。

  str_list = str.split(%r{=|&}) 
  next if str_list[1].split('/')[2] == "webcache.googleusercontent.com"
  # success("Site found: #{urls_to_log}")
  File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}

有经过充分测试的轮子可以将 URL 撕成零件,所以请使用它们。 Ruby 自带 URI,这让我们可以轻松提取 hostpathquery:

require 'uri'

URL = 'http://foo.com/a/b/c?d=1'

URI.parse(URL).host
# => "foo.com"
URI.parse(URL).path
# => "/a/b/c"
URI.parse(URL).query
# => "d=1"

Ruby 的 Enumerable module includes reject and select 可以轻松遍历数组或可枚举对象并拒绝其中的或 select 个元素:

(1..3).select{ |i| i.even? } # => [2]
(1..3).reject{ |i| i.even? } # => [1, 3]

使用所有你可以检查 URL 主机的子字符串并拒绝任何你不想要的:

require 'uri'

%w[
  http://www.speedtest.net/
  http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
].reject{ |url| URI.parse(url).host[/googleusercontent\.com$/] }
# => ["http://www.speedtest.net/"]

使用这些方法和技巧,您可以从输入文件中拒绝或 select,或者只是查看单个 URL 并选择忽略或尊重它们。