如何从 Perl 中的 XML 文档中删除重复注释?
How do I remove duplicate notes from an XML document in Perl?
我有一个带有重复节点的站点地图视频文件 xml :
<?xml version="1.0"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<url>
<loc>http://www.tubtun.com/video/Samsung_42Channel_Wireless_SoundStand</loc>
<video:video>
<video:title>Samsung 42Channel Wireless SoundStand</video:title>
<video:description>Samsung 4.2Channel Wireless SoundStand</video:description>
<video:thumbnail_loc>http://www.tubtun.com/media/files_thumbnail/user91/pl_5364844b0dc.jpg</video:thumbnail_loc>
<video:player_loc>http://www.tubtun.com/modules/vPlayer/vPlayer.swf?f=http://www.tubtun.com/modules/vPlayer/vPlayercfg.php?fid=844b0dc2c7258f4de11</video:player_loc>
<video:publication_date>2015-01-27</video:publication_date>
</video:video>
</url>
<url>
<loc>http://www.tubtun.com/video/Samsung_42Channel_Wireless_SoundStand</loc>
<video:video>
<video:title>Samsung 42Channel Wireless SoundStand</video:title>
<video:description>Samsung 4.2Channel Wireless SoundStand</video:description>
<video:thumbnail_loc>http://www.tubtun.com/media/files_thumbnail/user91/pl_5364844b0dc.jpg</video:thumbnail_loc>
<video:player_loc>http://www.tubtun.com/modules/vPlayer/vPlayer.swf?f=http://www.tubtun.com/modules/vPlayer/vPlayercfg.php?fid=844b0dc2c7258f4de11</video:player_loc>
<video:publication_date>2015-01-27</video:publication_date>
</video:video>
</url>
.....
我已经编写了一个 perl 脚本来删除这些重复数据:
use strict;
use warnings;
use XML::LibXML;
my $file = 'sitemap.xml';
my $doc = XML::LibXML->load_xml( location => $file );
my %seen;
foreach my $uni ( $doc->findnodes('//url') ) { # 'university' nodes only
my $name = $uni->find('video:title');
print "'$name' duplicated\n",
$uni->unbindNode() if $seen{$name}++; # Remove if seen before
}
$doc->toFile('clarified.xml'); # Print to file
很遗憾,文件 "clarified.xml" 与 sitemap.xml 相同。
我不知道我的脚本有什么问题。
我不太清楚为什么你的 XML::LibXML
不工作,尽管如评论中所述 - 如果它不与 find
一起工作,那将是它的根源。
我将提供一个使用 XML::Twig
确实有效的替代方案。
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $file = 'test3.xml';
my %seen;
sub delete_url_if_seen {
my ( $twig, $url ) = @_;
my $name = $url -> get_xpath('./video:video/video:title',0) -> trimmed_text;
if ( $seen{$name}++ ) { $url -> delete; };
}
my $twig = XML::Twig -> new ( 'pretty_print' => 'indented_a',
'twig_handlers' => { 'url' => \&delete_url_if_seen } );
$twig -> parsefile_inplace ( $file );
我成功了,这是代码,我尝试了
中提供的解决方案
use strict;
use warnings;
use XML::LibXML;
my $file = 'sitemap.xml';
my $doc = XML::LibXML->load_xml( location => $file );
my %seen;
foreach my $uni ( $doc->findnodes("//*[name() ='url']") ) { # 'university' nodes only
my $name = $uni->find('//video:title');
print "'$name' duplicated\n",
$uni->unbindNode() if $seen{$name}++; # Remove if seen before
}
$doc->toFile('clarified.xml'); # Print to file
您应该使用 XPathContext
并注册视频和默认命名空间。您还应该调用 findvalue
以获取字符串形式的标题。
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs(sitemap => 'http://www.sitemaps.org/schemas/sitemap/0.9');
$xpc->registerNs(video => 'http://www.google.com/schemas/sitemap-video/1.1');
for my $node ($xpc->findnodes('//sitemap:url', $doc)) {
my $name = $xpc->findvalue('video:title', $node);
...
}
我有一个带有重复节点的站点地图视频文件 xml :
<?xml version="1.0"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<url>
<loc>http://www.tubtun.com/video/Samsung_42Channel_Wireless_SoundStand</loc>
<video:video>
<video:title>Samsung 42Channel Wireless SoundStand</video:title>
<video:description>Samsung 4.2Channel Wireless SoundStand</video:description>
<video:thumbnail_loc>http://www.tubtun.com/media/files_thumbnail/user91/pl_5364844b0dc.jpg</video:thumbnail_loc>
<video:player_loc>http://www.tubtun.com/modules/vPlayer/vPlayer.swf?f=http://www.tubtun.com/modules/vPlayer/vPlayercfg.php?fid=844b0dc2c7258f4de11</video:player_loc>
<video:publication_date>2015-01-27</video:publication_date>
</video:video>
</url>
<url>
<loc>http://www.tubtun.com/video/Samsung_42Channel_Wireless_SoundStand</loc>
<video:video>
<video:title>Samsung 42Channel Wireless SoundStand</video:title>
<video:description>Samsung 4.2Channel Wireless SoundStand</video:description>
<video:thumbnail_loc>http://www.tubtun.com/media/files_thumbnail/user91/pl_5364844b0dc.jpg</video:thumbnail_loc>
<video:player_loc>http://www.tubtun.com/modules/vPlayer/vPlayer.swf?f=http://www.tubtun.com/modules/vPlayer/vPlayercfg.php?fid=844b0dc2c7258f4de11</video:player_loc>
<video:publication_date>2015-01-27</video:publication_date>
</video:video>
</url>
.....
我已经编写了一个 perl 脚本来删除这些重复数据:
use strict;
use warnings;
use XML::LibXML;
my $file = 'sitemap.xml';
my $doc = XML::LibXML->load_xml( location => $file );
my %seen;
foreach my $uni ( $doc->findnodes('//url') ) { # 'university' nodes only
my $name = $uni->find('video:title');
print "'$name' duplicated\n",
$uni->unbindNode() if $seen{$name}++; # Remove if seen before
}
$doc->toFile('clarified.xml'); # Print to file
很遗憾,文件 "clarified.xml" 与 sitemap.xml 相同。
我不知道我的脚本有什么问题。
我不太清楚为什么你的 XML::LibXML
不工作,尽管如评论中所述 - 如果它不与 find
一起工作,那将是它的根源。
我将提供一个使用 XML::Twig
确实有效的替代方案。
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $file = 'test3.xml';
my %seen;
sub delete_url_if_seen {
my ( $twig, $url ) = @_;
my $name = $url -> get_xpath('./video:video/video:title',0) -> trimmed_text;
if ( $seen{$name}++ ) { $url -> delete; };
}
my $twig = XML::Twig -> new ( 'pretty_print' => 'indented_a',
'twig_handlers' => { 'url' => \&delete_url_if_seen } );
$twig -> parsefile_inplace ( $file );
我成功了,这是代码,我尝试了
中提供的解决方案use strict;
use warnings;
use XML::LibXML;
my $file = 'sitemap.xml';
my $doc = XML::LibXML->load_xml( location => $file );
my %seen;
foreach my $uni ( $doc->findnodes("//*[name() ='url']") ) { # 'university' nodes only
my $name = $uni->find('//video:title');
print "'$name' duplicated\n",
$uni->unbindNode() if $seen{$name}++; # Remove if seen before
}
$doc->toFile('clarified.xml'); # Print to file
您应该使用 XPathContext
并注册视频和默认命名空间。您还应该调用 findvalue
以获取字符串形式的标题。
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs(sitemap => 'http://www.sitemaps.org/schemas/sitemap/0.9');
$xpc->registerNs(video => 'http://www.google.com/schemas/sitemap-video/1.1');
for my $node ($xpc->findnodes('//sitemap:url', $doc)) {
my $name = $xpc->findvalue('video:title', $node);
...
}