在 Perl 中解析 HTML table

Question

我正在尝试解析以下 HTML table :

<table cellspacing="0" border="1" width="100%">
 <tr bgcolor="#d0d0d0">
  <th style="COLOR: #ff6600">number</th>
  <th style="COLOR: #ff6600">id</th>
  <th style="COLOR: #ff6600">result</th>
  <th style="COLOR: #ff6600">reason</th>
 </tr>
 <tr>
  <td>1027</td>
  <td><a href="<url>">21cs_337</a></td>
  <td>0</td>
  <td>catch-all caught </td>
  <td>reason</td>  
 </tr>
 <tr>
  <td>10288</td>
  <td><a href="<url>">21cs_437</a></td>
  <td>0</td>
  <td>badfetch</td>
  <td>reason</td>
 </tr>
</table>

我正在尝试从我的 perl 脚本中读取此 html 文件中的值。我为此使用 HTML::TagParser 并且我能够获取每一行的值：

$table_old = ($html_old->getElementsByTagName("tr"))[1]->innerText();

但我想获取每一列（每一行）的值。我试过这个：

$table_new = ($html_new->getElementsByTagName("tr"))[1];  
my $temp  = ($table_new->getElementsByTagName("td"))[2]->innerText();

这不起作用，关于如何有效解析列元素的任何建议。

Answer 1

您需要使用子树。

#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TagParser;

my $html = HTML::TagParser->new( 'foo.html' ); # Change this to your file

my $nrow = 0;
for my $tr ( $html->getElementsByTagName("tr" ) ) {
    my $ncol = 0;
    for my $td ( $tr->subTree->getElementsByTagName("td") ) {
        print "Row [$nrow], Col [" . $ncol++ . "], Value [" . $td->innerText() . "]\n";
    }
    $nrow++;
}

产生以下输出（注意省略了第 th 行）：

Row [1], Col [0], Value [1027]
Row [1], Col [1], Value [21cs_337]
Row [1], Col [2], Value [0]
Row [1], Col [3], Value [catch-all caught]
Row [1], Col [4], Value [reason]
Row [2], Col [0], Value [10288]
Row [2], Col [1], Value [21cs_437]
Row [2], Col [2], Value [0]
Row [2], Col [3], Value [badfetch]
Row [2], Col [4], Value [reason]

在 Perl 中解析 HTML table

Parsing a HTML table in Perl

html

perl

html-table

html-parsing