在 Ruby 中无法使用 Nokogiri 抓取数据
Unable to scrape data using Nokogiri in Ruby
我目前正在尝试使用 Nokogiri 从网页中抓取数据。
我想从 link http://www.cardekho.com/Maruti/Noida/car-service-center.htm
中抓取服务中心列表的数据
我为此编写的代码是:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.cardekho.com/Maruti/Noida/car-service-center.htm"))
doc.css('.delrname').each do |node|
puts node.text
end
我尝试了一堆 CSS 标签的组合,但其中 none 给出了所需的 result.Can 有人建议可以正确抓取服务列表数据的标签以这个 link 为中心?
提前致谢
PS: 当我在其他网站上测试时,相同的代码(带有适当的 CSS 标记)按预期工作,但在本网站上不起作用。
您的代码似乎有效。我删除了 url 中的空格:
doc = Nokogiri::HTML(open("http://www.cardekho.com/Maruti/Noida/car-service-center.htm"))
然后我试了一下,这是输出:
$ ruby file.rb Fast Track Auto Care India
Jkm Motors
Mangalam Motors
Motorcraft India
Motorcraft India
Rohan Motors
Rohan Motors
Rohan Motors
Vipul Motors
您可以选择使用正则表达式来获得更详细的结果...例如,使用:
/(<div class="delrname">([^<]*)<\/div><p>([^<]*)<\/p><div><div class="delermobcol "><div class="clearfix"><span class="mobico sprite"><\/span><div class="mobno">([^<]*)<\/div><\/div><div class="clear"><\/div><div class="viewsercntr"><a href="([^"]*)" title="View Car Dealers for Maruti in Noida">View Car Dealers for Maruti in Noida<\/a><\/div><\/div><div class="delermoilcol"><!----><div class="clearfix"><span class="mailico sprite"><\/span><div class="mobno"><a href="mailto:([^"]*)" target="_top">workshop.grn@rohanmotors.co.in<\/a><\/div>)/
您可以细分结果,例如:
arrMatches = doc.scan(/(<div class="delrname">([^<]*)<\/div><p>([^<]*)<\/p><div><div class="delermobcol "><div class="clearfix"><span class="mobico sprite"><\/span><div class="mobno">([^<]*)<\/div><\/div><div class="clear"><\/div><div class="viewsercntr"><a href="([^"]*)" title="View Car Dealers for Maruti in Noida">View Car Dealers for Maruti in Noida<\/a><\/div><\/div><div class="delermoilcol"><!----><div class="clearfix"><span class="mailico sprite"><\/span><div class="mobno"><a href="mailto:([^"]*)" target="_top">workshop.grn@rohanmotors.co.in<\/a><\/div>)/)
arrMatches.each do |dealerInfo|
thisEntireMatch = dealerInfo[0]
thisName = dealerInfo[1]
thisAddress = dealerInfo[2]
thisMobile = dealerInfo[3]
thisLink = dealerInfo[4]
thisEmail = dealerInfo[5]
end
我目前正在尝试使用 Nokogiri 从网页中抓取数据。 我想从 link http://www.cardekho.com/Maruti/Noida/car-service-center.htm
中抓取服务中心列表的数据我为此编写的代码是:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.cardekho.com/Maruti/Noida/car-service-center.htm"))
doc.css('.delrname').each do |node|
puts node.text
end
我尝试了一堆 CSS 标签的组合,但其中 none 给出了所需的 result.Can 有人建议可以正确抓取服务列表数据的标签以这个 link 为中心?
提前致谢
PS: 当我在其他网站上测试时,相同的代码(带有适当的 CSS 标记)按预期工作,但在本网站上不起作用。
您的代码似乎有效。我删除了 url 中的空格:
doc = Nokogiri::HTML(open("http://www.cardekho.com/Maruti/Noida/car-service-center.htm"))
然后我试了一下,这是输出:
$ ruby file.rb Fast Track Auto Care India
Jkm Motors
Mangalam Motors
Motorcraft India
Motorcraft India
Rohan Motors
Rohan Motors
Rohan Motors
Vipul Motors
您可以选择使用正则表达式来获得更详细的结果...例如,使用:
/(<div class="delrname">([^<]*)<\/div><p>([^<]*)<\/p><div><div class="delermobcol "><div class="clearfix"><span class="mobico sprite"><\/span><div class="mobno">([^<]*)<\/div><\/div><div class="clear"><\/div><div class="viewsercntr"><a href="([^"]*)" title="View Car Dealers for Maruti in Noida">View Car Dealers for Maruti in Noida<\/a><\/div><\/div><div class="delermoilcol"><!----><div class="clearfix"><span class="mailico sprite"><\/span><div class="mobno"><a href="mailto:([^"]*)" target="_top">workshop.grn@rohanmotors.co.in<\/a><\/div>)/
您可以细分结果,例如:
arrMatches = doc.scan(/(<div class="delrname">([^<]*)<\/div><p>([^<]*)<\/p><div><div class="delermobcol "><div class="clearfix"><span class="mobico sprite"><\/span><div class="mobno">([^<]*)<\/div><\/div><div class="clear"><\/div><div class="viewsercntr"><a href="([^"]*)" title="View Car Dealers for Maruti in Noida">View Car Dealers for Maruti in Noida<\/a><\/div><\/div><div class="delermoilcol"><!----><div class="clearfix"><span class="mailico sprite"><\/span><div class="mobno"><a href="mailto:([^"]*)" target="_top">workshop.grn@rohanmotors.co.in<\/a><\/div>)/)
arrMatches.each do |dealerInfo|
thisEntireMatch = dealerInfo[0]
thisName = dealerInfo[1]
thisAddress = dealerInfo[2]
thisMobile = dealerInfo[3]
thisLink = dealerInfo[4]
thisEmail = dealerInfo[5]
end