使用 Ruby / Nokogiri 解析随机 class 名称
Using Ruby / Nokogiri to parse randomized class names
我一直在手算美国总统大选各州的剩余票数。有这么多的更新和状态——这让人很累。那么为什么不自动化这个过程呢?
这是我正在查看的内容:
问题是 class 名称已随机化。例如,这是我感兴趣的:
<td class="jsx-3768461732 votes votes-row">2,450,186</td>
在 irb 中尝试,我尝试在“votes votes-row”上使用通配符,因为它只在我需要它时出现在文档中:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css("[td*='votes-row']")
...没有结果 (=> []
)
我做错了什么以及如何解决?我对 xpath 没问题 – 我只是想确保在文档中其他地方所做的更改不会影响找到这些元素。
可能有更好的方法,但是...
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css('tr[class*="candidate-row"]').map { |row| row.css('td').map { |cell| cell.content } }
biden_row = votes.find_index { |row| row[0] =~ /biden/i }
trump_row = votes.find_index { |row| row[0] =~ /trump/i }
biden_votes = votes[biden_row][1].split('%')[1]
trump_votes = votes[trump_row][1].split('%')[1]
编辑:来自 HTML 来源的相关 table 看起来像:
<table class="jsx-1526769828 candidate-table">
<thead class="jsx-3554868417 table-head">
<tr class="jsx-3554868417">
<th class="table-header jsx-3554868417 candidate-name">
<h5 class="jsx-3554868417">Candidate</h5>
</th>
<th class="table-header jsx-3554868417 percent">
<h5 class="jsx-3554868417">Pct.</h5>
</th>
<th class="table-header jsx-3554868417 vote-bar"></th>
</tr>
</thead>
<tbody class="jsx-2085888330 table-head">
<tr class="jsx-2677388595 candidate-row">
<td class="jsx-3948343365 candidate-name name-row">
<div class="jsx-1912693590 name-only candidate-short-name">Biden</div>
<div class="jsx-3948343365 candidate-party-tag">
<div class="jsx-1420258095 party-label dem">dem</div>
</div>
<div class="jsx-3948343365 candidate-winner-check"></div>
</td>
<td class="jsx-3830922081 percent percent-row">
<div class="candidate-percent-only jsx-3830922081">49.4%</div>
<div class="candidate-votes-next-to-percent jsx-3830922081">2,450,193</div>
</td>
<td class="jsx-3458171655 vote-bar vote-bar-row">
<div style="width:49.4%" class="jsx-3458171655 bar dem"></div>
</td>
</tr>
<tr class="jsx-2677388595 candidate-row">
<td class="jsx-3948343365 candidate-name name-row">
<div class="jsx-1912693590 name-only candidate-short-name">Trump*</div>
<div class="jsx-3948343365 candidate-party-tag">
<div class="jsx-1420258095 party-label gop">gop</div>
</div>
<div class="jsx-3948343365 candidate-winner-check"></div>
</td>
<td class="jsx-3830922081 percent percent-row">
<div class="candidate-percent-only jsx-3830922081">49.4%</div>
<div class="candidate-votes-next-to-percent jsx-3830922081">2,448,635</div>
</td>
<td class="jsx-3458171655 vote-bar vote-bar-row">
<div style="width:49.4%" class="jsx-3458171655 bar gop"></div>
</td>
</tr>
</tbody>
</table>
因此您可能可以使用 candidate-votes-next-to-percent 来获取此值。例如:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css('tr[class*="candidate-row"]').map do |row|
[
row.css('div[class*="candidate-short-name"]').first.content,
row.css('div[class*="candidate-votes-next-to-percent"]').first.content
]
end
# => [["Biden", "2,450,193"], ["Trump*", "2,448,635"]]
我一直在手算美国总统大选各州的剩余票数。有这么多的更新和状态——这让人很累。那么为什么不自动化这个过程呢?
这是我正在查看的内容:
问题是 class 名称已随机化。例如,这是我感兴趣的:
<td class="jsx-3768461732 votes votes-row">2,450,186</td>
在 irb 中尝试,我尝试在“votes votes-row”上使用通配符,因为它只在我需要它时出现在文档中:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css("[td*='votes-row']")
...没有结果 (=> []
)
我做错了什么以及如何解决?我对 xpath 没问题 – 我只是想确保在文档中其他地方所做的更改不会影响找到这些元素。
可能有更好的方法,但是...
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css('tr[class*="candidate-row"]').map { |row| row.css('td').map { |cell| cell.content } }
biden_row = votes.find_index { |row| row[0] =~ /biden/i }
trump_row = votes.find_index { |row| row[0] =~ /trump/i }
biden_votes = votes[biden_row][1].split('%')[1]
trump_votes = votes[trump_row][1].split('%')[1]
编辑:来自 HTML 来源的相关 table 看起来像:
<table class="jsx-1526769828 candidate-table">
<thead class="jsx-3554868417 table-head">
<tr class="jsx-3554868417">
<th class="table-header jsx-3554868417 candidate-name">
<h5 class="jsx-3554868417">Candidate</h5>
</th>
<th class="table-header jsx-3554868417 percent">
<h5 class="jsx-3554868417">Pct.</h5>
</th>
<th class="table-header jsx-3554868417 vote-bar"></th>
</tr>
</thead>
<tbody class="jsx-2085888330 table-head">
<tr class="jsx-2677388595 candidate-row">
<td class="jsx-3948343365 candidate-name name-row">
<div class="jsx-1912693590 name-only candidate-short-name">Biden</div>
<div class="jsx-3948343365 candidate-party-tag">
<div class="jsx-1420258095 party-label dem">dem</div>
</div>
<div class="jsx-3948343365 candidate-winner-check"></div>
</td>
<td class="jsx-3830922081 percent percent-row">
<div class="candidate-percent-only jsx-3830922081">49.4%</div>
<div class="candidate-votes-next-to-percent jsx-3830922081">2,450,193</div>
</td>
<td class="jsx-3458171655 vote-bar vote-bar-row">
<div style="width:49.4%" class="jsx-3458171655 bar dem"></div>
</td>
</tr>
<tr class="jsx-2677388595 candidate-row">
<td class="jsx-3948343365 candidate-name name-row">
<div class="jsx-1912693590 name-only candidate-short-name">Trump*</div>
<div class="jsx-3948343365 candidate-party-tag">
<div class="jsx-1420258095 party-label gop">gop</div>
</div>
<div class="jsx-3948343365 candidate-winner-check"></div>
</td>
<td class="jsx-3830922081 percent percent-row">
<div class="candidate-percent-only jsx-3830922081">49.4%</div>
<div class="candidate-votes-next-to-percent jsx-3830922081">2,448,635</div>
</td>
<td class="jsx-3458171655 vote-bar vote-bar-row">
<div style="width:49.4%" class="jsx-3458171655 bar gop"></div>
</td>
</tr>
</tbody>
</table>
因此您可能可以使用 candidate-votes-next-to-percent 来获取此值。例如:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css('tr[class*="candidate-row"]').map do |row|
[
row.css('div[class*="candidate-short-name"]').first.content,
row.css('div[class*="candidate-votes-next-to-percent"]').first.content
]
end
# => [["Biden", "2,450,193"], ["Trump*", "2,448,635"]]