如何用 XML::LibXML 解析 xml 字典

How do I parse an xml dictionary with XML::LibXML

如何用 XML::LibXML

解析 xml 字典

至少我认为它叫字典。我没有找到好的资源描述:

  1. 这种XML格式
  2. XML::LibXML。 Perldoc 甚至没有提到我在 Whosebug 上找到的 findvalue 或 textContent。

我有这个xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Clippings</key>
    <array>
        <dict>
            <key>Abbreviation</key>
            <string>;adeb</string>
            <key>Creation Date</key>
            <string>2017-04-22T22:02:32Z</string>
            <key>DateLastUsed</key>
            <string>2021-05-21T13:53:20Z</string>
            <key>Label</key>
            <string></string>
            <key>Modification Date</key>
            <string>2018-04-30T18:16:36Z</string>
            <key>Option</key>
            <string></string>
            <key>Plain Text</key>
            <string>print STDERR "(debug ⌘) $_\n" for @⌘;</string>
            <key>SortOrder</key>
            <string>1</string>
            <key>TotUsed</key>
            <integer>579</integer>
            <key>uuidString</key>
            <string>96707AF9-E9C4-4770-A930-B6889C354243</string>
        </dict>
        <dict>
            <key>Abbreviation</key>
            <string>;ahtml</string>
            <key>DateLastUsed</key>
            <string>2016-03-09T15:50:43Z</string>
            <key>Modification Date</key>
            <string>2016-05-18T09:35:49Z</string>
            <key>Option</key>
            <string></string>
            <key>Plain Text</key>
            <string>print "(debug) Enter to continue:\n";
my $debug = &lt;STDIN&gt;;</string>
            <key>Service</key>
            <string></string>
            <key>SortOrder</key>
            <string>130</string>
            <key>TotUsed</key>
            <integer>1</integer>
            <key>uuidString</key>
            <string>E74E6ADC-7BE3-493B-AD1E-32729CAB2B77</string>
        </dict>
    </array>
    <key>sortOrderForThisSet</key>
    <string>0</string>
    <key>theAbbsFilePath</key>
    <string>/Users/user/path/to/file</string>
    <key>unsavedChanges</key>
    <false/>
    <key>uuidCheckDone</key>
    <string>yes</string>
</dict>
</plist>

到目前为止,正在尝试这个 perl:

#!/usr/bin/perl
use strict;
use warnings;

my $filename = '/Users/user/path/to/file';

use XML::LibXML;

# Parse the XML
my $xml = XML::LibXML->load_xml(location => $filename);

for my $entry ($xml->findnodes('//dict')) {
    my $key = $entry->findvalue('@key');
    my $value = $entry->textContent;

    print "$key = $value";
}

value 将所有值打印在一起,而 key 根本不打印任何内容。

我通过以下方式找到了有关 findvalue 和 textContent 的更多信息: perldoc XML::LibXML::XPathContext

for my $entry ($xml->findnodes('//dict')) {
  use Data::Dumper::Simple;
  print STDERR "(debug) " . Dumper($entry);

显示它是: XML::LibXML::Element 它也有 POD,所以你可以在上面使用 perldoc。

这是我最终得到的代码。由于很多原因,这不是一个好的解决方案,首先它仍然没有正确读取 XML。但是,它在最基本的层面上起作用,当然,我时间紧迫。

#!/usr/bin/perl
use strict;
use warnings;

use XML::LibXML;
use utf8;
use Encode;

binmode *STDOUT, ':encoding(UTF-8)';    # warning: global effect
binmode *STDERR, ':encoding(UTF-8)';

my $filename = $ENV{HOME} . '/TypeIt4Me/clippings.typeit4me.test-dev-input';

# Parse the XML
my $xml = XML::LibXML->load_xml( location => $filename );

# Iterate the entries
my %input_xml_hr;

for my $entry ( $xml->findnodes('//dict') ) {
    print "\n";
    my %hr;

    my @keys;
    my @vals;
    foreach my $key ( $entry->findnodes('key') ) {
        push @keys, $key->to_literal();
    }
    foreach my $string ( $entry->findnodes('string') ) {
        push @vals, $string->to_literal();
    }

    my $count_keys = scalar @keys;
    my $count_vals = scalar @vals;
    if ( $count_keys ne $count_vals ) {

        # print STDERR "[WARN] keys: $count_keys, vals: $count_vals\n";
        $hr{'WARN'} = "keys: $count_keys, vals: $count_vals";
    }

    foreach my $num ( 0 .. $#keys ) {
        my $key = $keys[$num];
        my $val = $vals[$num];
        $hr{$key} = encode_utf8 $val;
    }

    $input_xml_hr{ $hr{'Abbreviation'} } = $hr{'Plain Text'};
}

for my $key ( keys %input_xml_hr ) {
    print $key . " = " . decode_utf8 $input_xml_hr{$key} . "\n" if defined $input_xml_hr{$key};
}

print "\n";

首先,让我们来解决一下您绝对错误的说法,即“Perldoc 甚至没有提到 findvalue 或 textContent”。

文档节点记录在 XML::LibXML::Document 中。以下几乎是该文件所说的第一件事:

It inherits all functions from XML::LibXML::Node as specified in the DOM specification.

元素节点记录在 XML::LibXML::Element 中。以下是文件的第一句话:

The class inherits from XML::LibXML::Node. The documentation for Inherited methods is not listed here.

两者都findvalue and textContent are documented in XML::LibXML::Node


关于实际问题。

设计 XML 架构的人被告知要使用 XML,但显然不理解 XML。甚至对于模式来编码任意数据结构(如 JSON)也是不好的。设计它的人只是为了在使用前将数据转换为不同的格式。以下是这样做的:

use strict;
use warnings;
use feature qw( say state );

use Carp              qw( croak );
use Types::Serialiser qw( );
use XML::LibXML       qw( );

sub qname {
   my ($node) = @_;
   my $ns   = $node->namespaceURI();
   my $name = $node->nodeName();
   return defined($ns) ? "{$ns}$name" : $name;
}

sub deserialize_array {
   my ($array_node) = @_;
   return [ map { deserialize_value($_) } $array_node->findnodes("*") ];
}

sub deserialize_dict {
   my ($dict_node) = @_;

   my $dict = {};
   my @children = $dict_node->findnodes("*");
   while (@children) {
      my $key_node = shift(@children);
      qname($key_node) eq "key"
         or croak("Expected key");

      my $val_node = shift(@children)
         or croak("Expected value");

      my $key = $key_node->textContent();
      my $val = deserialize_value($val_node);
      $dict->{$key} = $val;
   }

   return $dict;
}

sub deserialize_value {
   my ($val_node) = @_;

   state $deserializers = {
      string  => sub { $_[0]->textContent() },
      integer => sub { 0 + $_[0]->textContent() },
      real    => sub { 0 + $_[0]->textContent() },
      true    => sub { $Types::Serialiser::true },
      false   => sub { $Types::Serialiser::false },
      data    => sub { croak("data values not currently supported"); },
      date    => sub { croak("date values not currently supported"); },
      array   => \&deserialize_array,
      dict    => \&deserialize_dict,
   };

   my $val_type = qname($val_node);
   my $deserializer = $deserializers->{$val_type}
      or croak("Unrecognized value type \"$val_type\"");

   return $deserializer->($val_node);
}

sub deserialize_doc {
   my ($doc) = @_;
   my @children = $doc->documentElement->findnodes("*");
   croak("Root element has too few children") if @children == 0;
   croak("Root element has too many children") if @children > 1;
   return deserialize_value($children[0]);
}

{
   my $doc = XML::LibXML->load_xml( location => $ARGV[0] );
   my $prop_list = deserialize_doc($doc);
   ...
}

如果你转储 $prop_list,它看起来像

$prop_list = {
  "Clippings" => [
    {
      "Creation Date" => "2017-04-22T22:02:32Z",
      "TotUsed" => 579,
      "DateLastUsed" => "2021-05-21T13:53:20Z",
      "Abbreviation" => ";adeb",
      "Plain Text" => "print STDERR \"(debug \x{2318}) $_\n\" for \@\x{2318};",
      "uuidString" => "96707AF9-E9C4-4770-A930-B6889C354243",
      "Modification Date" => "2018-04-30T18:16:36Z",
      "SortOrder" => 1,
      "Label" => "",
      "Option" => ""
    },
    {
      "Option" => "",
      "Modification Date" => "2016-05-18T09:35:49Z",
      "SortOrder" => 130,
      "uuidString" => "E74E6ADC-7BE3-493B-AD1E-32729CAB2B77",
      "Plain Text" => "print \"(debug) Enter to continue:\n\";\nmy $debug = <STDIN>;",
      "Abbreviation" => ";ahtml",
      "Service" => "",
      "TotUsed" => 1,
      "DateLastUsed" => "2016-03-09T15:50:43Z"
    }
  ],
  "unsavedChanges" => $false,
  "sortOrderForThisSet" => 0,
  "uuidCheckDone" => "yes",
  "theAbbsFilePath" => "/Users/user/path/to/file"
};

以上是使用

得到的
use Data::Dumper;
local $Data::Dumper::Indent = 1;
local $Data::Dumper::Useqq = 1;
print(Data::Dumper->Dump(
   [ $Types::Serialiser::true, $Types::Serialiser::false, $prop_list ],
   [qw( true false prop_list )],
));