使用 Nokogiri gem 抓取网站时如何过滤我的结果?

How do I filter my results when scraping a website using Nokogiri gem?

我正在尝试从 Deliveroo.co.uk

中抓取我的邮政编码的餐馆列表

我需要添加一种方法来确定餐厅是开门还是关门...从网站上看很清楚,但我只需要更新我的代码来反映这一点。

我该怎么做?我需要创建类似 'status' 的变量,然后将每个餐厅设置为 'open' 或 'closed'。

这是我要从中抓取的网站:https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE&time=1800&day=today

下面是我的代码。

谢谢。

    require 'open-uri'
    require 'nokogiri'
    require 'csv'

    # Store URL to be scraped
   url = "https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE"

    # Parse the page with Nokogiri
    page = Nokogiri::HTML(open(url))

    # Display output onto the screen
    name =[]
    page.css('span.list-item-title.restaurant-name').each do |line|
     name << line.text
    end

   category = []
   page.css('span.restaurant-detail.detail-cat').each do |line|
    category << line.text
   end

   delivery_time = []
   page.css('span.restaurant-detail.detail-time').each do |line|
     delivery_time << line.text
   end

  distance = []
  page.css('span.restaurant-detail.detail-distance').each do |line|
   distance << line.text
  end

  status = []

  # Write data to CSV file
  CSV.open("deliveroo.csv", "w") do |file|
  file << ["Name", "Category", "Delivery Time", "Distance", "Status"]
  name.length.times do |i|
  file << [name[i], category[i], delivery_time[i], distance[i]]
  end
  end
  end

我们需要检查 li.restaurant--details 有/没有 class unavailable 关闭/打开餐厅。

status = []
page.css('li.restaurant--details').each do |line|
  if line.attr("class").include? "unavailable"
    sts = "closed"
  else
    sts = "open"
  end
  status << sts
end

顺便说一句,你应该在获取 restaurant_name 时删除白色 space,等等...

page.css('span.list-item-title.restaurant-name').each do |line|
 name << line.text.strip
end

你可以在这里参考我的代码:https://gist.github.com/vinhnglx/4eaeb2e8511dd1454f42