在 Perl 中使用 HTML::TreeBuilder 提取特定范围 class 的所有实例
Use HTML::TreeBuilder in Perl to extract all instances of a specific span class
正在尝试制作 Perl 脚本以打开 HTML 文件并提取 <span class="postertrip">
标签中包含的所有内容。
样本HTML:
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply2">
<a name="2"></a> <label><input type="checkbox" name="delete" value="1199313466,2" /> <span class="replytitle"></span> <span class="commentpostername"><a href="test">Test1</a></span><span class="postertrip"><a href="test">!AAAAAAAA</a></span> 08/01/03(Thu)02:06</label> <span class="reflink"> <a href="test">No.2</a> </span> <br /> <span class="filesize">File: <a target="_blank" href="test">1199326003295.jpg</a> -(<em>65843 B, 288x412</em>)</span> <span class="thumbnailmsg">Thumbnail displayed, click image for full size.</span><br /> <a target="_blank" test"> <img src="test" width="139" height="200" alt="65843" class="thumb" /></a>
<blockquote>
<p>Test message 1</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply5">
<a name="5"></a> <label><input type="checkbox" name="delete" value="1199313466,5" /> <span class="replytitle"></span> <span class="commentpostername">Test2</span><span class="postertrip">!BBBBBBBB</span> 08/01/03(Thu)16:12</label> <span class="reflink"> <a href="test">No.5</a> </span>
<blockquote>
<p>Test message 2</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply7">
<a name="7"></a> <label><input type="checkbox" name="delete" value="1199161229,7" /> <span class="replytitle"></span> <span class="commentpostername">Test3</span><span class="postertrip">!CCCCCCCC.</span> 08/01/01(Tue)17:53</label> <span class="reflink"> <a href="test">No.7</a> </span>
<blockquote>
<p>Test message 3</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
期望的输出:
!AAAAAAAA
!BBBBBBBB
!CCCCCCCC
当前脚本:
#!/usr/bin/env perl
use warnings;
use strict;
use 5.010;
use HTML::TreeBuilder;
open(my $html, "<", "temp.html")
or die "Can't open";
my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);
foreach my $e ($tree->look_down('class', 'reply')) {
my $e = $tree->look_down('class', 'postertrip');
say $e->as_text;
}
脚本输出错误:
!AAAAAAAA
!AAAAAAAA
!AAAAAAAA
在您的 foreach 循环中,您必须从找到的元素向下看。所以正确的代码是:
foreach my $parent ($tree->look_down('class', 'reply')) {
my $e = $parent->look_down('class', 'postertrip');
say $e->as_text;
}
我从来不喜欢 HTML::TreeBuilder。有点杂乱,三年没更新了。不过,使用 CSS select 或 Mojo::DOM 非常简单。它的 find
完成各种 look_down
所做的所有工作:
use v5.10;
use Mojo::DOM;
my $html = do { local $/; <DATA> };
my @values = Mojo::DOM->new( $html )
->find( 'td.reply span.postertrip' )
->map( 'all_text' )
->each;
say join "\n", @values;
请注意,在您的 HTML::TreeBuilder 代码中,您没有 select 您关心的标签的逻辑。你可以做到,但你需要额外的工作。 CSS select 们会为您处理。
正在尝试制作 Perl 脚本以打开 HTML 文件并提取 <span class="postertrip">
标签中包含的所有内容。
样本HTML:
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply2">
<a name="2"></a> <label><input type="checkbox" name="delete" value="1199313466,2" /> <span class="replytitle"></span> <span class="commentpostername"><a href="test">Test1</a></span><span class="postertrip"><a href="test">!AAAAAAAA</a></span> 08/01/03(Thu)02:06</label> <span class="reflink"> <a href="test">No.2</a> </span> <br /> <span class="filesize">File: <a target="_blank" href="test">1199326003295.jpg</a> -(<em>65843 B, 288x412</em>)</span> <span class="thumbnailmsg">Thumbnail displayed, click image for full size.</span><br /> <a target="_blank" test"> <img src="test" width="139" height="200" alt="65843" class="thumb" /></a>
<blockquote>
<p>Test message 1</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply5">
<a name="5"></a> <label><input type="checkbox" name="delete" value="1199313466,5" /> <span class="replytitle"></span> <span class="commentpostername">Test2</span><span class="postertrip">!BBBBBBBB</span> 08/01/03(Thu)16:12</label> <span class="reflink"> <a href="test">No.5</a> </span>
<blockquote>
<p>Test message 2</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply7">
<a name="7"></a> <label><input type="checkbox" name="delete" value="1199161229,7" /> <span class="replytitle"></span> <span class="commentpostername">Test3</span><span class="postertrip">!CCCCCCCC.</span> 08/01/01(Tue)17:53</label> <span class="reflink"> <a href="test">No.7</a> </span>
<blockquote>
<p>Test message 3</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
期望的输出:
!AAAAAAAA
!BBBBBBBB
!CCCCCCCC
当前脚本:
#!/usr/bin/env perl
use warnings;
use strict;
use 5.010;
use HTML::TreeBuilder;
open(my $html, "<", "temp.html")
or die "Can't open";
my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);
foreach my $e ($tree->look_down('class', 'reply')) {
my $e = $tree->look_down('class', 'postertrip');
say $e->as_text;
}
脚本输出错误:
!AAAAAAAA
!AAAAAAAA
!AAAAAAAA
在您的 foreach 循环中,您必须从找到的元素向下看。所以正确的代码是:
foreach my $parent ($tree->look_down('class', 'reply')) {
my $e = $parent->look_down('class', 'postertrip');
say $e->as_text;
}
我从来不喜欢 HTML::TreeBuilder。有点杂乱,三年没更新了。不过,使用 CSS select 或 Mojo::DOM 非常简单。它的 find
完成各种 look_down
所做的所有工作:
use v5.10;
use Mojo::DOM;
my $html = do { local $/; <DATA> };
my @values = Mojo::DOM->new( $html )
->find( 'td.reply span.postertrip' )
->map( 'all_text' )
->each;
say join "\n", @values;
请注意,在您的 HTML::TreeBuilder 代码中,您没有 select 您关心的标签的逻辑。你可以做到,但你需要额外的工作。 CSS select 们会为您处理。