Ruby - Scraper 连接字符串
Ruby - Scraper concatenate strings
我正在制作一个 Ruby 网络抓取工具来收集一些信息。
在我要抓取的页面的 HTML 中,每篇文章有 3 个相等的跨度:
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a>
<div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div>
<span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a>
<div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div>
<span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a>
<div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div>
<span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
但是,有些文章没有最后一个跨度("more details")
目前,我一直在使用这个代码:
#first loop to find the title
page.css('a.item-link').each do |line|
puts line.text
end
#Second loop to find the price
page.css('span.item-price').each do |line|
puts line.text
end
#third loop to find the details
page.css('span.item-detail').each do |line|
line.text
end
我正在使用 Nokogiri gem 和 open-uri 来检索和解析文件。
如何连接3个跨度(有些文章在"item-detail" class中只有两个跨度)并在屏幕上打印出来?
我想要的输出是:
title 1
title 2
title 3
200€
300€
500€
T2
T5
T1
20 m²
50 m²
100 m²
more details 1
" "
more details 3
有些文章没有第三个跨度("more details n"),所以如果是这种情况,我将打印“”。我的目标是将结果写入 .csv 文件
这是适用于示例输入的代码,尽管我不得不稍微修改输入 XML 以包含在单个 HTML 节点 (<document>
) 中可正确解析:
require "nokogiri"
html = <<HTML
<document>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a>
<div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div>
<span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a>
<div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div>
<span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a>
<div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div>
<span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
</document>
HTML
page = Nokogiri::XML(html)
articles = page.css('article')
articles.each do |article|
article.css('a.item-link').each do |link|
puts "#{link[:title]}"
end
end
articles.each do |article|
article.css('span.item-price').each do |price|
puts "#{price.text}"
end
end
articles.each do |article|
detail_spans = article.css('span.item-detail')
puts "#{detail_spans[0].text}"
end
articles.each do |article|
detail_spans = article.css('span.item-detail')
puts "#{detail_spans[1].text}"
end
articles.each do |article|
detail_spans = article.css('span.item-detail')
puts "#{detail_spans[2] ? detail_spans[2].text.strip : ' '.inspect }"
end
此代码检索 article
元素的数组,然后使用数组中的每个文章元素来限定对包含在其中的元素的其他查询。这提供了对单个元素值进行细粒度报告的能力。
最后的 item-detail
查询使用元素检测来确定在存在可能不存在的元素时如何输出值。其他查询可能需要这种技术,具体取决于实际 HTML 文档内容。
这些是结果:
title 1
title 2
title 3
200€
300€
500€
T2
T5
T1
20 m²
50 m²
100 m²
more details 1
" "
more details 3
我正在制作一个 Ruby 网络抓取工具来收集一些信息。 在我要抓取的页面的 HTML 中,每篇文章有 3 个相等的跨度:
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a>
<div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div>
<span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a>
<div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div>
<span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a>
<div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div>
<span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
但是,有些文章没有最后一个跨度("more details")
目前,我一直在使用这个代码:
#first loop to find the title
page.css('a.item-link').each do |line|
puts line.text
end
#Second loop to find the price
page.css('span.item-price').each do |line|
puts line.text
end
#third loop to find the details
page.css('span.item-detail').each do |line|
line.text
end
我正在使用 Nokogiri gem 和 open-uri 来检索和解析文件。
如何连接3个跨度(有些文章在"item-detail" class中只有两个跨度)并在屏幕上打印出来?
我想要的输出是:
title 1
title 2
title 3
200€
300€
500€
T2
T5
T1
20 m²
50 m²
100 m²
more details 1
" "
more details 3
有些文章没有第三个跨度("more details n"),所以如果是这种情况,我将打印“”。我的目标是将结果写入 .csv 文件
这是适用于示例输入的代码,尽管我不得不稍微修改输入 XML 以包含在单个 HTML 节点 (<document>
) 中可正确解析:
require "nokogiri"
html = <<HTML
<document>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a>
<div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div>
<span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a>
<div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div>
<span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
<article>
<div class="item item_contains_branding" data-adid="1234567">
<div class="clearfix" style="display: block;">
<div class="item-multimedia ">
...
</div>
<div class="item-info-container">
<div class="logo-branding">
...
</div>
<a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a>
<div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div>
<span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span>
<p class="item-description">description...</p>
<div class="item-toolbar clearfix">
...
</div>
</div>
</div>
</div>
</article>
</document>
HTML
page = Nokogiri::XML(html)
articles = page.css('article')
articles.each do |article|
article.css('a.item-link').each do |link|
puts "#{link[:title]}"
end
end
articles.each do |article|
article.css('span.item-price').each do |price|
puts "#{price.text}"
end
end
articles.each do |article|
detail_spans = article.css('span.item-detail')
puts "#{detail_spans[0].text}"
end
articles.each do |article|
detail_spans = article.css('span.item-detail')
puts "#{detail_spans[1].text}"
end
articles.each do |article|
detail_spans = article.css('span.item-detail')
puts "#{detail_spans[2] ? detail_spans[2].text.strip : ' '.inspect }"
end
此代码检索 article
元素的数组,然后使用数组中的每个文章元素来限定对包含在其中的元素的其他查询。这提供了对单个元素值进行细粒度报告的能力。
最后的 item-detail
查询使用元素检测来确定在存在可能不存在的元素时如何输出值。其他查询可能需要这种技术,具体取决于实际 HTML 文档内容。
这些是结果:
title 1
title 2
title 3
200€
300€
500€
T2
T5
T1
20 m²
50 m²
100 m²
more details 1
" "
more details 3