如何使用 XML::Twig 跳过不需要的元素?
How to skip unwanted elements using XML::Twig?
正在尝试学习 XML::Twig 并从 XML 文档中获取一些数据。
我的 XML 包含 20k+ <ADN>
个元素。每个 <ADN>
元素包含数十个子元素,其中之一是 <GID>
。我只想处理那些 ADN
,其中 GID
== 1。(参见示例 XML 是 __DATA__
)
文档说:
Handlers are triggered in fixed order, sorted by their type (xpath
expressions first, then regexps, then level), then by whether they
specify a full path (starting at the root element) or not, then by
number of steps in the expression , then number of predicates, then
number of tests in predicates. Handlers where the last step does not
specify a step (foo/bar/*) are triggered after other XPath handlers.
Finally all handlers are triggered last.
Important: once a handler has been triggered if it returns 0 then no
other handler is called, except a all handler which will be called
anyway.
我的实际代码:
use 5.014;
use warnings;
use XML::Twig;
use Data::Dumper;
my $cat = load_xml_catalog();
say Dumper $cat;
sub load_xml_catalog {
my $hr;
my $current;
my $twig= XML::Twig->new(
twig_roots => {
ADN => sub { # process the <ADN> elements
$_->purge; # and purge when finishes with one
},
},
twig_handlers => {
'ADN/GID' => sub {
return 1 if $_->trimmed_text == 1;
return 0; # skip the other handlers - if the GID != 1
},
'ADN/ID' => sub { #remember the ID as a "key" into the '$hr' for the "current" ADN
$current = $_->trimmed_text;
$hr->{$current}{$_->tag} = $_->trimmed_text;
},
#rules for the wanted data extracting & storing to $hr->{$current}
'ADN/Name' => sub {
$hr->{$current}{$_->tag} = $_->text;
},
},
);
$twig->parse(\*DATA);
return $hr;
}
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>
输出
$VAR1 = {
'1000' => {
'ID' => '1000',
'Name' => 'other name 1000'
},
'1' => {
'Name' => 'name 1',
'ID' => '1'
},
'20' => {
'Name' => 'should be skipped because GID != 1',
'ID' => '20'
}
};
所以,
- 当 GID != 1 时
ADN/GID
returns 0
的处理程序。
- 为什么其他处理程序仍然被调用?
- 预期(想要的)输出没有
'20' => ...
.
- 如何正确跳过不需要的节点?[=41=]
在这种情况下,"returns zero" 事情有点转移注意力。如果您的元素有多个匹配项,那么 其中一个 返回零会抑制其他匹配项。
这并不意味着它不会继续尝试处理后续节点。
我认为您感到困惑 - 您有 <ADN>
元素的单独子元素的处理程序 - 它们分别触发。这是设计使然。 xpath
有优先顺序,但仅适用于重复匹配项。你的是完全独立的,所以它们都是 'fire' 因为它们触发不同的元素。
但是,您可能会发现了解它很有用 - twig_handlers
允许 xpath
表达式 - 因此您可以明确地说:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->parse( \*DATA );
$twig -> set_pretty_print('indented_a');
foreach my $ADN ( $twig -> findnodes('//ADN/GID[string()="1"]/..') ) {
$ADN -> print;
}
这也适用于 twig_handlers
语法。我建议只有在需要预处理 XML 或内存受限时,处理程序才真正有用。你可能有 20,000 个节点。 (此时 purge
是你的朋友)。
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
pretty_print => 'indented_a',
twig_handlers => {
'//ADN[string(GID)="1"]' => sub { $_->print }
}
);
$twig->parse( \*DATA );
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>
不过,我可能会这样做:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
sub process_ADN {
my ( $twig, $ADN ) = @_;
return unless $ADN -> first_child_text('GID') == 1;
print "ADN with name:", $ADN -> first_child_text('Name')," Found\n";
}
my $twig = XML::Twig->new(
pretty_print => 'indented_a',
twig_handlers => {
'ADN' => \&process_ADN
}
);
$twig->parse( \*DATA );
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>
正在尝试学习 XML::Twig 并从 XML 文档中获取一些数据。
我的 XML 包含 20k+ <ADN>
个元素。每个 <ADN>
元素包含数十个子元素,其中之一是 <GID>
。我只想处理那些 ADN
,其中 GID
== 1。(参见示例 XML 是 __DATA__
)
文档说:
Handlers are triggered in fixed order, sorted by their type (xpath expressions first, then regexps, then level), then by whether they specify a full path (starting at the root element) or not, then by number of steps in the expression , then number of predicates, then number of tests in predicates. Handlers where the last step does not specify a step (foo/bar/*) are triggered after other XPath handlers. Finally all handlers are triggered last.
Important: once a handler has been triggered if it returns 0 then no other handler is called, except a all handler which will be called anyway.
我的实际代码:
use 5.014;
use warnings;
use XML::Twig;
use Data::Dumper;
my $cat = load_xml_catalog();
say Dumper $cat;
sub load_xml_catalog {
my $hr;
my $current;
my $twig= XML::Twig->new(
twig_roots => {
ADN => sub { # process the <ADN> elements
$_->purge; # and purge when finishes with one
},
},
twig_handlers => {
'ADN/GID' => sub {
return 1 if $_->trimmed_text == 1;
return 0; # skip the other handlers - if the GID != 1
},
'ADN/ID' => sub { #remember the ID as a "key" into the '$hr' for the "current" ADN
$current = $_->trimmed_text;
$hr->{$current}{$_->tag} = $_->trimmed_text;
},
#rules for the wanted data extracting & storing to $hr->{$current}
'ADN/Name' => sub {
$hr->{$current}{$_->tag} = $_->text;
},
},
);
$twig->parse(\*DATA);
return $hr;
}
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>
输出
$VAR1 = {
'1000' => {
'ID' => '1000',
'Name' => 'other name 1000'
},
'1' => {
'Name' => 'name 1',
'ID' => '1'
},
'20' => {
'Name' => 'should be skipped because GID != 1',
'ID' => '20'
}
};
所以,
- 当 GID != 1 时
ADN/GID
returns0
的处理程序。 - 为什么其他处理程序仍然被调用?
- 预期(想要的)输出没有
'20' => ...
. - 如何正确跳过不需要的节点?[=41=]
在这种情况下,"returns zero" 事情有点转移注意力。如果您的元素有多个匹配项,那么 其中一个 返回零会抑制其他匹配项。
这并不意味着它不会继续尝试处理后续节点。
我认为您感到困惑 - 您有 <ADN>
元素的单独子元素的处理程序 - 它们分别触发。这是设计使然。 xpath
有优先顺序,但仅适用于重复匹配项。你的是完全独立的,所以它们都是 'fire' 因为它们触发不同的元素。
但是,您可能会发现了解它很有用 - twig_handlers
允许 xpath
表达式 - 因此您可以明确地说:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->parse( \*DATA );
$twig -> set_pretty_print('indented_a');
foreach my $ADN ( $twig -> findnodes('//ADN/GID[string()="1"]/..') ) {
$ADN -> print;
}
这也适用于 twig_handlers
语法。我建议只有在需要预处理 XML 或内存受限时,处理程序才真正有用。你可能有 20,000 个节点。 (此时 purge
是你的朋友)。
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
pretty_print => 'indented_a',
twig_handlers => {
'//ADN[string(GID)="1"]' => sub { $_->print }
}
);
$twig->parse( \*DATA );
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>
不过,我可能会这样做:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
sub process_ADN {
my ( $twig, $ADN ) = @_;
return unless $ADN -> first_child_text('GID') == 1;
print "ADN with name:", $ADN -> first_child_text('Name')," Found\n";
}
my $twig = XML::Twig->new(
pretty_print => 'indented_a',
twig_handlers => {
'ADN' => \&process_ADN
}
);
$twig->parse( \*DATA );
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>