Nokogyri CSS 二维数组的方法
Nokogiri CSS method to 2D array
我正在尝试创建一个简单的网络抓取工具,但遇到了一些问题。
网站的结构是这样的:
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-18">Sun 01-18-15 09:10 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210190">TIGERS</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-25">Sun 01-25-15 06:40 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208345">LIONS</a></td>
<td><a href="/facilities/22/teams/208362">CYCLONES</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-02-01">Sun 02-01-15 12:50 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210041">CLAY</a></td>
</tr>
我目前拥有的是:
games = page.css("td[class='gametime']").map{|game| game.parent.css("a").text}
此 returns 一个包含三个元素的字符串数组(在本例中)。但是我想要得到的是一个二维数组,例如:
games[0][0] #=> Sun 01-18-15 09:10 PM
games[0][1] #=> CYCLONES
games[0][2] #=> TIGERS
我不想要这个(我目前得到的):
games[0] #=> Sun 01-18-15 09:10 PMCYCLONESTIGERS
实现此目标的最佳方法是什么?
我认为 text
不会为您制作一个数组。我认为您需要嵌套 map
语句:
games = page.css("td[class='gametime']").map{|game| game.parent.css("a").map(&:text)}
你很接近:
games = page.css("td.gametime").map { |i| i.parent.css("a").map { |j| j.text } }
对于每个 td.gametime
,转到其父级并获取所有 a
标签,然后将它们映射到它们的文本。这将为您提供每个游戏的三个值数组,以及页面的数组数组。
我会这样做:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-18">Sun 01-18-15 09:10 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210190">TIGERS</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-25">Sun 01-25-15 06:40 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208345">LIONS</a></td>
<td><a href="/facilities/22/teams/208362">CYCLONES</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-02-01">Sun 02-01-15 12:50 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210041">CLAY</a></td>
</tr>
EOT
代码如下:
games = doc.search('tr').map{ |tr| tr.search('td').map(&:text) }
# => [["Sun 01-18-15 09:10 PM", "CYCLONES", "TIGERS"],
# ["Sun 01-25-15 06:40 PM", "LIONS", "CYCLONES"],
# ["Sun 02-01-15 12:50 PM", "CYCLONES", "CLAY"]]
games[0][0] # => "Sun 01-18-15 09:10 PM"
games[0][1] # => "CYCLONES"
games[0][2] # => "TIGERS"
没有必要为此 HTML 抓取 <td>
标签内的内部标签。有时会有额外的文本被忽略,这将是必要的,但由于它很简单,代码也可以很简单。 text
对于 <td>
节点将 return 嵌入其中的文本节点。
我严重怀疑他们提供的 HTML 是那么简单,没有更多细节我无法给出更准确的答案。 (它 behooves/benefits 你提供足够详细和准确的输入。)不过,一般的想法是找到包含你想要的行的 table,然后向下钻取:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<table class="foo">
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-18">Sun 01-18-15 09:10 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210190">TIGERS</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-25">Sun 01-25-15 06:40 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208345">LIONS</a></td>
<td><a href="/facilities/22/teams/208362">CYCLONES</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-02-01">Sun 02-01-15 12:50 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210041">CLAY</a></td>
</tr>
</table>
<table class="bar">
</table>
EOT
修改后的代码:
games = doc.search('table.foo tr').map{ |tr| tr.search('td').map(&:text) }
# => [["Sun 01-18-15 09:10 PM", "CYCLONES", "TIGERS"],
# ["Sun 01-25-15 06:40 PM", "LIONS", "CYCLONES"],
# ["Sun 02-01-15 12:50 PM", "CYCLONES", "CLAY"]]
games[0][0] # => "Sun 01-18-15 09:10 PM"
games[0][1] # => "CYCLONES"
games[0][2] # => "TIGERS"
我正在尝试创建一个简单的网络抓取工具,但遇到了一些问题。
网站的结构是这样的:
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-18">Sun 01-18-15 09:10 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210190">TIGERS</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-25">Sun 01-25-15 06:40 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208345">LIONS</a></td>
<td><a href="/facilities/22/teams/208362">CYCLONES</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-02-01">Sun 02-01-15 12:50 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210041">CLAY</a></td>
</tr>
我目前拥有的是:
games = page.css("td[class='gametime']").map{|game| game.parent.css("a").text}
此 returns 一个包含三个元素的字符串数组(在本例中)。但是我想要得到的是一个二维数组,例如:
games[0][0] #=> Sun 01-18-15 09:10 PM
games[0][1] #=> CYCLONES
games[0][2] #=> TIGERS
我不想要这个(我目前得到的):
games[0] #=> Sun 01-18-15 09:10 PMCYCLONESTIGERS
实现此目标的最佳方法是什么?
我认为 text
不会为您制作一个数组。我认为您需要嵌套 map
语句:
games = page.css("td[class='gametime']").map{|game| game.parent.css("a").map(&:text)}
你很接近:
games = page.css("td.gametime").map { |i| i.parent.css("a").map { |j| j.text } }
对于每个 td.gametime
,转到其父级并获取所有 a
标签,然后将它们映射到它们的文本。这将为您提供每个游戏的三个值数组,以及页面的数组数组。
我会这样做:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-18">Sun 01-18-15 09:10 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210190">TIGERS</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-25">Sun 01-25-15 06:40 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208345">LIONS</a></td>
<td><a href="/facilities/22/teams/208362">CYCLONES</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-02-01">Sun 02-01-15 12:50 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210041">CLAY</a></td>
</tr>
EOT
代码如下:
games = doc.search('tr').map{ |tr| tr.search('td').map(&:text) }
# => [["Sun 01-18-15 09:10 PM", "CYCLONES", "TIGERS"],
# ["Sun 01-25-15 06:40 PM", "LIONS", "CYCLONES"],
# ["Sun 02-01-15 12:50 PM", "CYCLONES", "CLAY"]]
games[0][0] # => "Sun 01-18-15 09:10 PM"
games[0][1] # => "CYCLONES"
games[0][2] # => "TIGERS"
没有必要为此 HTML 抓取 <td>
标签内的内部标签。有时会有额外的文本被忽略,这将是必要的,但由于它很简单,代码也可以很简单。 text
对于 <td>
节点将 return 嵌入其中的文本节点。
我严重怀疑他们提供的 HTML 是那么简单,没有更多细节我无法给出更准确的答案。 (它 behooves/benefits 你提供足够详细和准确的输入。)不过,一般的想法是找到包含你想要的行的 table,然后向下钻取:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<table class="foo">
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-18">Sun 01-18-15 09:10 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210190">TIGERS</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-01-25">Sun 01-25-15 06:40 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208345">LIONS</a></td>
<td><a href="/facilities/22/teams/208362">CYCLONES</a></td>
</tr>
<tr>
<td class="gametime"><a href="/facilities/22/games?exact_date=15-02-01">Sun 02-01-15 12:50 PM</a></td>
<td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
<td><a href="/facilities/22/teams/210041">CLAY</a></td>
</tr>
</table>
<table class="bar">
</table>
EOT
修改后的代码:
games = doc.search('table.foo tr').map{ |tr| tr.search('td').map(&:text) }
# => [["Sun 01-18-15 09:10 PM", "CYCLONES", "TIGERS"],
# ["Sun 01-25-15 06:40 PM", "LIONS", "CYCLONES"],
# ["Sun 02-01-15 12:50 PM", "CYCLONES", "CLAY"]]
games[0][0] # => "Sun 01-18-15 09:10 PM"
games[0][1] # => "CYCLONES"
games[0][2] # => "TIGERS"