如何从 html table in bash 的列中读取整数
How to read integers from a column of an html table in bash
我有一个 html table:
<table>
<tr><td colspan=2>"some text"</td><td>"last week"</td><td>"current week"</td><td>"Delta"</td></tr>
<tr><td>"some text"</td><td>"some text"</td><td>integer</td><td>integer</td><td>integer</td></tr>
<tr><td>"some text"</td><td>"some text"</td><td>integer</td><td>integer</td><td>integer</td></tr>
<tr><td>"some text"</td><td>"some text"</td><td>integer</td><td>integer</td><td>integer</td></tr>
<tr><td>"some text"</td><td>"some text"</td><td>integer</td><td>integer</td><td>integer</td></tr>
</table>
我想从 "current week" 列中提取每个整数,所以从每一行中提取第二个整数(没有第一行 - header 行)。
Perl 来拯救:有 HTML::TableExtract
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TableExtract;
my $te = 'HTML::TableExtract'->new( headers => [ 'current' ] );
$te->parse('<table>...</table>');
my $tab = ($te->tables)[0];
for my $row ($tab->rows) {
print $row->[0], "\n";
}
输入HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<table> <tr>
<td colspan="2">"some text"</td>
<td>"last week"</td>
<td>"current week"</td>
<td>"Delta"</td>
</tr> <tr>
<td>"some text"</td>
<td>"some text"</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr> <tr>
<td>"some text"</td>
<td>"some text"</td>
<td>integer</td>
<td>integer</td>
<td>integer</td>
</tr> <tr>
<td>"some text"</td>
<td>"some text"</td>
<td>integer</td>
<td>integer</td>
<td>integer</td>
</tr> <tr>
<td>"some text"</td>
<td>"some text"</td>
<td>integer</td>
<td>integer</td>
<td>integer</td>
</tr> </table>
</body></html>
xmllint :
$ xmllint --html --xpath "//td[text()='\"current week\"']/following::td[4]/text()" file_or_URL
或将 xpath 与数字位置(1 到 ...)一起使用:简单地:
$ xmllint --html --xpath "//tr[2]/td[4]/text()" file_or_URL
输出:
1
我有一个 html table:
<table>
<tr><td colspan=2>"some text"</td><td>"last week"</td><td>"current week"</td><td>"Delta"</td></tr>
<tr><td>"some text"</td><td>"some text"</td><td>integer</td><td>integer</td><td>integer</td></tr>
<tr><td>"some text"</td><td>"some text"</td><td>integer</td><td>integer</td><td>integer</td></tr>
<tr><td>"some text"</td><td>"some text"</td><td>integer</td><td>integer</td><td>integer</td></tr>
<tr><td>"some text"</td><td>"some text"</td><td>integer</td><td>integer</td><td>integer</td></tr>
</table>
我想从 "current week" 列中提取每个整数,所以从每一行中提取第二个整数(没有第一行 - header 行)。
Perl 来拯救:有 HTML::TableExtract
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TableExtract;
my $te = 'HTML::TableExtract'->new( headers => [ 'current' ] );
$te->parse('<table>...</table>');
my $tab = ($te->tables)[0];
for my $row ($tab->rows) {
print $row->[0], "\n";
}
输入HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<table> <tr>
<td colspan="2">"some text"</td>
<td>"last week"</td>
<td>"current week"</td>
<td>"Delta"</td>
</tr> <tr>
<td>"some text"</td>
<td>"some text"</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr> <tr>
<td>"some text"</td>
<td>"some text"</td>
<td>integer</td>
<td>integer</td>
<td>integer</td>
</tr> <tr>
<td>"some text"</td>
<td>"some text"</td>
<td>integer</td>
<td>integer</td>
<td>integer</td>
</tr> <tr>
<td>"some text"</td>
<td>"some text"</td>
<td>integer</td>
<td>integer</td>
<td>integer</td>
</tr> </table>
</body></html>
xmllint :
$ xmllint --html --xpath "//td[text()='\"current week\"']/following::td[4]/text()" file_or_URL
或将 xpath 与数字位置(1 到 ...)一起使用:简单地:
$ xmllint --html --xpath "//tr[2]/td[4]/text()" file_or_URL
输出:
1