在使用 Mojo::DOM 处理 HTML 文档时,如何最可靠地保留 HTML 实体?

How do I most reliably preserve HTML Entities when processing HTML documents with Mojo::DOM?

我正在使用 Mojo::DOM 来识别和打印出我从现有的数百个 HTML 文档中提取的短语(意思是所选 HTML 标签之间的文本字符串) Movable Type 内容管理系统中的内容。

我正在将这些短语写到文件中,以便将它们翻译成其他语言,如下所示:

        $dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));

    ##########
    #
    # Break down the Body into phrases. This is done by listing the tags and tag combinations that
    # surround each block of text that we're looking to capture.
    #
    ##########

        print FILE "\n\t### Body\n\n";        

        for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {

            print_phrase($phrase); # utility function to write out the phrase to a file

        }

当 Mojo::DOM 遇到嵌入的 HTML 实体(例如 ™ )时,它会将这些实体转换为编码字符,而不是按书面形式传递。我希望实体按书面形式传递。

我认识到我可以使用 Mojo::Util::decode 将这些 HTML 实体传递到我正在编写的文件中。问题是“You can only call decode 'UTF-8' on a string that contains valid UTF-8. 如果没有,例如因为它已经转换为 Perl 字符,它将 return 取消定义。”

如果是这种情况,我必须在调用 Mojo::Util::decode('UTF-8', $page->text) 之前尝试弄清楚如何测试当前 HTML 页面的编码,或者我必须使用其他一些技术来保留编码的 HTML 个实体。

在使用 Mojo::DOM 处理 HTML 文档时,如何最可靠地保留编码的 HTML 实体?

看起来当您映射到文本时,您会替换 XML 个实体,但是当您改为使用节点并使用它们的内容时,实体将被保留。这个最小的例子:

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p>this &amp; &quot;that&quot;</p>');
for my $phrase ($dom->find('p')->each) {
    print $phrase->content(), "\n";
}

打印:

this &amp; &quot;that&quot;

如果您想保留循环和地图,请将 map('text') 替换为 map('content'),如下所示:

for my $phrase ($dom->find('p')->map('content')->each) {

如果您有嵌套标签并且只想查找文本(但不打印那些嵌套标签名称,只打印它们的内容),您需要扫描 DOM 树:

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p><i>this &amp; <b>&quot;</b><b>that</b><b>&quot;</b></i></p><p>done</p>');

for my $node (@{$dom->find('p')->to_array}) {
    print_content($node);
}

sub print_content {
    my ($node) = @_;
    if ($node->type eq "text") {
        print $node->content(), "\n";
    }
    if ($node->type eq "tag") {    
        for my $child ($node->child_nodes->each) {
            print_content($child);
        }
    }
}

打印:

this & 
"
that
"
done

通过测试,我和我的同事能够确定 Mojo::DOM->new() 正在自动解码与符号字符 (&),从而使 HTML 实体的保存无法写入。为了解决这个问题,我们添加了以下子例程来对&符号进行双重编码:

sub encode_amp {
    my ($text) = @_;

    ##########
    #
    # We discovered that we need to encode ampersand
    # characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
    # automatically by Mojo::DOM::Util::html_unescape().
    #
    # What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
    # any incoming ampersand or &amp; characters.
    #
    #
    ##########   

    $text .= '';           # Suppress uninitialized value warnings
    $text =~ s!&!&amp;!g;  # HTML encode ampersand characters
    return $text;
}

稍后在脚本中,我们通过 encode_amp() 传递 $page->text,因为我们实例化了一个新的 Mojo::DOM 对象。

    $dom = Mojo::DOM->new(encode_amp($page->text));

##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on Whosebug, see:
# 
#
#
# Original set of selectors in $dom->find() below is as follows:
#   'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########

    print FILE "\n\t### Body\n\n";        

    for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
        map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {           

        print_phrase($phrase);

    }

上面的代码块结合了@Grinnz 之前的建议,如本问题的评论所示。还要感谢@Robert 的回答,他很好地观察了 Mojo::DOM 的工作原理。

此代码绝对适用于我的应用程序。