当有多个 'p' 标签时,我该如何抓取?
How do I scrape when there are multiple 'p' tags?
我正在尝试抓取一个有多个 <p>
标签的网站,这些标签总是以 "Located in:..." 开头。 None 的其他 <p>
个标签以这些词开头。
如何让我的抓取工具只提取那些特定的标签?
这是scraper.rb:
require 'open-uri'
require 'nokogiri'
require 'csv'
# Store URL to be scraped
url = "http://www.timeout.com/london/restaurants/the-50-best-street-food-stalls-in-london?package_page=68111"
# Parse the page with Nokogiri
page = Nokogiri::HTML(open(url))
# Display output onto the screen
name =[]
page.css('h3').each do |line|
name << h3.text.strip
end
zero =[]
page.css('p').each do |line|
zero << line.text.strip
end
这是要抓取的传入HTML:
<div class="feature-item__text">
<h3>
Yu Kyu
</h3>
<p class="feature_item__annotation--truncated">
<p>Everybody knows that on any given visit to...</p>
<p><strong>Don't miss:</strong> Curry Katsu Sandwich (£6.50).</p>
<p><strong>Find them at:</strong><a href="http://www.timeout.com/london/restaurants/kerb">Kerb</a>.</p>
<p><strong>But first check:</strong> <a href="...">@_YuKyu_</a></p>
</p>
</div>
</div>
<div class="listing_meta_controls"></div>
</article>
如果我理解正确,你可以简单地做
zero =[]
page.css('p').each do |line|
text = line.text.strip
if text.present? && text.include? 'Located in'
zero << text
end
end
你的问题有几个问题,它如何与 HTML 保持一致。
网站可能正在更改措辞以摆脱抓取,并将 "Located in:" 更改为 "Find them at"。如果是这样,您在查找所需信息时可能不能相信它是一个路标。
也就是说,CSS 不允许我们查找以某物开头的文本,但 XPath 可以:
@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a')
该选择器将定位所有 <strong>Find them at:</strong>
标签和相邻的兄弟 <a>
标签,允许您处理标签的 text
或 'href'
参数,具体取决于什么你要。使用该选择器,我在页面上看到 84 次匹配,看起来像:
@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a').first.to_html
#=> "<a href=\"http://www.timeout.com/london/restaurants/kerb\">Kerb</a>"
@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a').first.text
#=> "Kerb"
@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a').first['href']
#=> "http://www.timeout.com/london/restaurants/kerb"
如果您想使用 CSS,这是可能的,但您必须采取不同的策略。查找包含 <div>
然后在里面搜索:
require 'nokogiri'
require 'open-uri'
URL = 'http://www.timeout.com/london/restaurants/the-50-best-street-food-stalls-in-london?package_page=68111'
doc = Nokogiri::HTML(open(URL))
feature_items = doc.search('div.feature-item__text').map{ |div|
h3 = div.at('h3').text.strip
a = div.at('strong + a')
a_text = a.text.strip
a_href = a['href']
{
h3: h3,
a_text: a_text,
a_href: a_href
}
}
这个 returns 一个散列数组,每个散列都是特定地点的信息。
这是找到的前五个:
feature_items[0, 5]
# => [{:h3=>"Yu Kyu",
# :a_text=>"Kerb",
# :a_href=>"http://www.timeout.com/london/restaurants/kerb"},
# {:h3=>"Luardos",
# :a_text=>"Kerb",
# :a_href=>"http://www.timeout.com/london/restaurants/kerb"},
# {:h3=>"Mission Mariscos",
# :a_text=>"The Schoolyard",
# :a_href=>"http://www.timeout.com/london/shopping/broadway-market-1"},
# {:h3=>"Butchies",
# :a_text=>"Broadway Market",
# :a_href=>"http://www.timeout.com/london/shopping/broadway-market-1"},
# {:h3=>"BBQ Dreamz",
# :a_text=>"Kerb",
# :a_href=>"http://www.timeout.com/london/restaurants/kerb"}]
我正在尝试抓取一个有多个 <p>
标签的网站,这些标签总是以 "Located in:..." 开头。 None 的其他 <p>
个标签以这些词开头。
如何让我的抓取工具只提取那些特定的标签?
这是scraper.rb:
require 'open-uri'
require 'nokogiri'
require 'csv'
# Store URL to be scraped
url = "http://www.timeout.com/london/restaurants/the-50-best-street-food-stalls-in-london?package_page=68111"
# Parse the page with Nokogiri
page = Nokogiri::HTML(open(url))
# Display output onto the screen
name =[]
page.css('h3').each do |line|
name << h3.text.strip
end
zero =[]
page.css('p').each do |line|
zero << line.text.strip
end
这是要抓取的传入HTML:
<div class="feature-item__text">
<h3>
Yu Kyu
</h3>
<p class="feature_item__annotation--truncated">
<p>Everybody knows that on any given visit to...</p>
<p><strong>Don't miss:</strong> Curry Katsu Sandwich (£6.50).</p>
<p><strong>Find them at:</strong><a href="http://www.timeout.com/london/restaurants/kerb">Kerb</a>.</p>
<p><strong>But first check:</strong> <a href="...">@_YuKyu_</a></p>
</p>
</div>
</div>
<div class="listing_meta_controls"></div>
</article>
如果我理解正确,你可以简单地做
zero =[]
page.css('p').each do |line|
text = line.text.strip
if text.present? && text.include? 'Located in'
zero << text
end
end
你的问题有几个问题,它如何与 HTML 保持一致。
网站可能正在更改措辞以摆脱抓取,并将 "Located in:" 更改为 "Find them at"。如果是这样,您在查找所需信息时可能不能相信它是一个路标。
也就是说,CSS 不允许我们查找以某物开头的文本,但 XPath 可以:
@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a')
该选择器将定位所有 <strong>Find them at:</strong>
标签和相邻的兄弟 <a>
标签,允许您处理标签的 text
或 'href'
参数,具体取决于什么你要。使用该选择器,我在页面上看到 84 次匹配,看起来像:
@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a').first.to_html
#=> "<a href=\"http://www.timeout.com/london/restaurants/kerb\">Kerb</a>"
@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a').first.text
#=> "Kerb"
@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a').first['href']
#=> "http://www.timeout.com/london/restaurants/kerb"
如果您想使用 CSS,这是可能的,但您必须采取不同的策略。查找包含 <div>
然后在里面搜索:
require 'nokogiri'
require 'open-uri'
URL = 'http://www.timeout.com/london/restaurants/the-50-best-street-food-stalls-in-london?package_page=68111'
doc = Nokogiri::HTML(open(URL))
feature_items = doc.search('div.feature-item__text').map{ |div|
h3 = div.at('h3').text.strip
a = div.at('strong + a')
a_text = a.text.strip
a_href = a['href']
{
h3: h3,
a_text: a_text,
a_href: a_href
}
}
这个 returns 一个散列数组,每个散列都是特定地点的信息。
这是找到的前五个:
feature_items[0, 5]
# => [{:h3=>"Yu Kyu",
# :a_text=>"Kerb",
# :a_href=>"http://www.timeout.com/london/restaurants/kerb"},
# {:h3=>"Luardos",
# :a_text=>"Kerb",
# :a_href=>"http://www.timeout.com/london/restaurants/kerb"},
# {:h3=>"Mission Mariscos",
# :a_text=>"The Schoolyard",
# :a_href=>"http://www.timeout.com/london/shopping/broadway-market-1"},
# {:h3=>"Butchies",
# :a_text=>"Broadway Market",
# :a_href=>"http://www.timeout.com/london/shopping/broadway-market-1"},
# {:h3=>"BBQ Dreamz",
# :a_text=>"Kerb",
# :a_href=>"http://www.timeout.com/london/restaurants/kerb"}]