如何使用 Perl::Mechanize 的解析器迭代 300 页?
How to iterate over a 300 pages with a parser with Perl::Mechanize?
我编写了一个小的解析器,可以从页面中提取数据。
use strict;
use warnings FATAL => qw#all#;
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
use Data::Dumper;
my $handler_relurl = sub { q#https://europa.eu# . $_[0] };
my $handler_trim = sub { $_[0] =~ s#^\s*(.+?)\s*$##r };
my $handler_val = sub { $_[0] =~ s#^[^:]+:\s*##r };
my $handler_split = sub { [ split $_[0], $_[1] ] };
my $handler_split_colon = sub { $handler_split->( qr#; #, $_[0] ) };
my $handler_split_comma = sub { $handler_split->( qr#, #, $_[0] ) };
my $conf =
{
url => q#https://europa.eu/youth/volunteering/evs-organisation_en#,
parent => q#//div[@class="vp ey_block block-is-flex"]#,
children =>
{
internal_url => [ q#//a/@href#, [ $handler_relurl ] ],
external_url => [ q#//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, [ $handler_trim ] ],
title => [ q#//h4# ],
topics => [ q#//div[@class="org_cord"]#, [ $handler_val, $handler_split_colon ] ],
location => [ q#//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, [ $handler_trim ] ],
hand => [ q#//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, [ $handler_trim, $handler_split_comma ] ],
pic_number => [ q#//p[contains(.,'PIC no')]#, [ $handler_val ] ],
}
};
print Dumper browse( $conf );
sub browse
{
my $conf = shift;
my $ref = [ ];
my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 );
my $response = $lwp_useragent->get( $conf->{url} );
die $response->status_line unless $response->is_success;
my $content = $response->decoded_content;
my $html_treebuilder_xpath = HTML::TreeBuilder::XPath->new_from_content( $content );
my @nodes = $html_treebuilder_xpath->findnodes( $conf->{parent} );
for my $node ( @nodes )
{
push @$ref, { };
while ( my ( $key, $val ) = each %{$conf->{children}} )
{
my $xpath = $val->[0];
my $handlers = $val->[1] // [ ];
$val = ($node->findvalues( qq#.$xpath# ))[0] // next;
$val = $_->( $val ) for @$handlers;
$ref->[-1]->{$key} = $val;
}
}
return $ref;
}
乍一看,关于从一个页面抓取到另一个页面的问题 - 可以通过不同的方法解决:
页面底部有分页:例如:
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=5
和
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=6
和
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7
我们可以将这个 url (s) 设置为基础 -
如果我们有一个数组,我们从中加载需要访问的 urls - 我们会遇到所有页面...
注意: 我们有超过 6000 个结果 - 每个页面上有 21 个小条目代表一条记录:因此我们必须访问大约 305 个页面。
我们可以增加页面(如上所示)并计数到 305
硬编码总页数不切实际,因为它可能会有所不同。我们可以:
- 从第一页中提取结果数,将其除以每页结果数 (21),然后向下舍入。
- 从页面底部的 "last" link 中提取 url,创建一个 URI 对象并从查询字符串中读取页码。
现在我想我必须遍历所有页面。
my $url_pattern = 'https://europa.eu/youth/volunteering/evs-organisation_en&page=%s';
for my $page ( 0 .. $last )
{
my $url = sprintf $url_pattern, $page;
...
}
或者我尝试将分页合并到 $conf 中,也许是一个迭代器,它在每次调用时获取下一个节点...
解析每个页面后,检查底部的next ›
link是否存在。当您到达第 292 页时,没有更多的页面,所以您已经完成并可以退出循环,例如last
.
我编写了一个小的解析器,可以从页面中提取数据。
use strict;
use warnings FATAL => qw#all#;
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
use Data::Dumper;
my $handler_relurl = sub { q#https://europa.eu# . $_[0] };
my $handler_trim = sub { $_[0] =~ s#^\s*(.+?)\s*$##r };
my $handler_val = sub { $_[0] =~ s#^[^:]+:\s*##r };
my $handler_split = sub { [ split $_[0], $_[1] ] };
my $handler_split_colon = sub { $handler_split->( qr#; #, $_[0] ) };
my $handler_split_comma = sub { $handler_split->( qr#, #, $_[0] ) };
my $conf =
{
url => q#https://europa.eu/youth/volunteering/evs-organisation_en#,
parent => q#//div[@class="vp ey_block block-is-flex"]#,
children =>
{
internal_url => [ q#//a/@href#, [ $handler_relurl ] ],
external_url => [ q#//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, [ $handler_trim ] ],
title => [ q#//h4# ],
topics => [ q#//div[@class="org_cord"]#, [ $handler_val, $handler_split_colon ] ],
location => [ q#//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, [ $handler_trim ] ],
hand => [ q#//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, [ $handler_trim, $handler_split_comma ] ],
pic_number => [ q#//p[contains(.,'PIC no')]#, [ $handler_val ] ],
}
};
print Dumper browse( $conf );
sub browse
{
my $conf = shift;
my $ref = [ ];
my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 );
my $response = $lwp_useragent->get( $conf->{url} );
die $response->status_line unless $response->is_success;
my $content = $response->decoded_content;
my $html_treebuilder_xpath = HTML::TreeBuilder::XPath->new_from_content( $content );
my @nodes = $html_treebuilder_xpath->findnodes( $conf->{parent} );
for my $node ( @nodes )
{
push @$ref, { };
while ( my ( $key, $val ) = each %{$conf->{children}} )
{
my $xpath = $val->[0];
my $handlers = $val->[1] // [ ];
$val = ($node->findvalues( qq#.$xpath# ))[0] // next;
$val = $_->( $val ) for @$handlers;
$ref->[-1]->{$key} = $val;
}
}
return $ref;
}
乍一看,关于从一个页面抓取到另一个页面的问题 - 可以通过不同的方法解决:
页面底部有分页:例如:
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=5
和
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=6
和
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7
我们可以将这个 url (s) 设置为基础 -
如果我们有一个数组,我们从中加载需要访问的 urls - 我们会遇到所有页面...
注意: 我们有超过 6000 个结果 - 每个页面上有 21 个小条目代表一条记录:因此我们必须访问大约 305 个页面。 我们可以增加页面(如上所示)并计数到 305
硬编码总页数不切实际,因为它可能会有所不同。我们可以: - 从第一页中提取结果数,将其除以每页结果数 (21),然后向下舍入。 - 从页面底部的 "last" link 中提取 url,创建一个 URI 对象并从查询字符串中读取页码。
现在我想我必须遍历所有页面。
my $url_pattern = 'https://europa.eu/youth/volunteering/evs-organisation_en&page=%s';
for my $page ( 0 .. $last )
{
my $url = sprintf $url_pattern, $page;
...
}
或者我尝试将分页合并到 $conf 中,也许是一个迭代器,它在每次调用时获取下一个节点...
解析每个页面后,检查底部的next ›
link是否存在。当您到达第 292 页时,没有更多的页面,所以您已经完成并可以退出循环,例如last
.