在使用 Mojo::DOM 处理 HTML 文档时,如何最可靠地保留 HTML 实体?
How do I most reliably preserve HTML Entities when processing HTML documents with Mojo::DOM?
我正在使用 Mojo::DOM 来识别和打印出我从现有的数百个 HTML 文档中提取的短语(意思是所选 HTML 标签之间的文本字符串) Movable Type 内容管理系统中的内容。
我正在将这些短语写到文件中,以便将它们翻译成其他语言,如下所示:
$dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));
##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
##########
print FILE "\n\t### Body\n\n";
for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {
print_phrase($phrase); # utility function to write out the phrase to a file
}
当 Mojo::DOM 遇到嵌入的 HTML 实体(例如 ™
和
)时,它会将这些实体转换为编码字符,而不是按书面形式传递。我希望实体按书面形式传递。
我认识到我可以使用 Mojo::Util::decode 将这些 HTML 实体传递到我正在编写的文件中。问题是“You can only call decode 'UTF-8' on a string that contains valid UTF-8. 如果没有,例如因为它已经转换为 Perl 字符,它将 return 取消定义。”
如果是这种情况,我必须在调用 Mojo::Util::decode('UTF-8', $page->text)
之前尝试弄清楚如何测试当前 HTML 页面的编码,或者我必须使用其他一些技术来保留编码的 HTML 个实体。
在使用 Mojo::DOM 处理 HTML 文档时,如何最可靠地保留编码的 HTML 实体?
看起来当您映射到文本时,您会替换 XML 个实体,但是当您改为使用节点并使用它们的内容时,实体将被保留。这个最小的例子:
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new('<p>this & "that"</p>');
for my $phrase ($dom->find('p')->each) {
print $phrase->content(), "\n";
}
打印:
this & "that"
如果您想保留循环和地图,请将 map('text')
替换为 map('content')
,如下所示:
for my $phrase ($dom->find('p')->map('content')->each) {
如果您有嵌套标签并且只想查找文本(但不打印那些嵌套标签名称,只打印它们的内容),您需要扫描 DOM 树:
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new('<p><i>this & <b>"</b><b>that</b><b>"</b></i></p><p>done</p>');
for my $node (@{$dom->find('p')->to_array}) {
print_content($node);
}
sub print_content {
my ($node) = @_;
if ($node->type eq "text") {
print $node->content(), "\n";
}
if ($node->type eq "tag") {
for my $child ($node->child_nodes->each) {
print_content($child);
}
}
}
打印:
this &
"
that
"
done
通过测试,我和我的同事能够确定 Mojo::DOM->new()
正在自动解码与符号字符 (&
),从而使 HTML 实体的保存无法写入。为了解决这个问题,我们添加了以下子例程来对&符号进行双重编码:
sub encode_amp {
my ($text) = @_;
##########
#
# We discovered that we need to encode ampersand
# characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
# automatically by Mojo::DOM::Util::html_unescape().
#
# What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
# any incoming ampersand or & characters.
#
#
##########
$text .= ''; # Suppress uninitialized value warnings
$text =~ s!&!&!g; # HTML encode ampersand characters
return $text;
}
稍后在脚本中,我们通过 encode_amp()
传递 $page->text
,因为我们实例化了一个新的 Mojo::DOM
对象。
$dom = Mojo::DOM->new(encode_amp($page->text));
##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on Whosebug, see:
#
#
#
# Original set of selectors in $dom->find() below is as follows:
# 'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########
print FILE "\n\t### Body\n\n";
for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {
print_phrase($phrase);
}
上面的代码块结合了@Grinnz 之前的建议,如本问题的评论所示。还要感谢@Robert 的回答,他很好地观察了 Mojo::DOM
的工作原理。
此代码绝对适用于我的应用程序。
我正在使用 Mojo::DOM 来识别和打印出我从现有的数百个 HTML 文档中提取的短语(意思是所选 HTML 标签之间的文本字符串) Movable Type 内容管理系统中的内容。
我正在将这些短语写到文件中,以便将它们翻译成其他语言,如下所示:
$dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));
##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
##########
print FILE "\n\t### Body\n\n";
for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {
print_phrase($phrase); # utility function to write out the phrase to a file
}
当 Mojo::DOM 遇到嵌入的 HTML 实体(例如 ™
和
)时,它会将这些实体转换为编码字符,而不是按书面形式传递。我希望实体按书面形式传递。
我认识到我可以使用 Mojo::Util::decode 将这些 HTML 实体传递到我正在编写的文件中。问题是“You can only call decode 'UTF-8' on a string that contains valid UTF-8. 如果没有,例如因为它已经转换为 Perl 字符,它将 return 取消定义。”
如果是这种情况,我必须在调用 Mojo::Util::decode('UTF-8', $page->text)
之前尝试弄清楚如何测试当前 HTML 页面的编码,或者我必须使用其他一些技术来保留编码的 HTML 个实体。
在使用 Mojo::DOM 处理 HTML 文档时,如何最可靠地保留编码的 HTML 实体?
看起来当您映射到文本时,您会替换 XML 个实体,但是当您改为使用节点并使用它们的内容时,实体将被保留。这个最小的例子:
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new('<p>this & "that"</p>');
for my $phrase ($dom->find('p')->each) {
print $phrase->content(), "\n";
}
打印:
this & "that"
如果您想保留循环和地图,请将 map('text')
替换为 map('content')
,如下所示:
for my $phrase ($dom->find('p')->map('content')->each) {
如果您有嵌套标签并且只想查找文本(但不打印那些嵌套标签名称,只打印它们的内容),您需要扫描 DOM 树:
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new('<p><i>this & <b>"</b><b>that</b><b>"</b></i></p><p>done</p>');
for my $node (@{$dom->find('p')->to_array}) {
print_content($node);
}
sub print_content {
my ($node) = @_;
if ($node->type eq "text") {
print $node->content(), "\n";
}
if ($node->type eq "tag") {
for my $child ($node->child_nodes->each) {
print_content($child);
}
}
}
打印:
this &
"
that
"
done
通过测试,我和我的同事能够确定 Mojo::DOM->new()
正在自动解码与符号字符 (&
),从而使 HTML 实体的保存无法写入。为了解决这个问题,我们添加了以下子例程来对&符号进行双重编码:
sub encode_amp {
my ($text) = @_;
##########
#
# We discovered that we need to encode ampersand
# characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
# automatically by Mojo::DOM::Util::html_unescape().
#
# What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
# any incoming ampersand or & characters.
#
#
##########
$text .= ''; # Suppress uninitialized value warnings
$text =~ s!&!&!g; # HTML encode ampersand characters
return $text;
}
稍后在脚本中,我们通过 encode_amp()
传递 $page->text
,因为我们实例化了一个新的 Mojo::DOM
对象。
$dom = Mojo::DOM->new(encode_amp($page->text));
##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on Whosebug, see:
#
#
#
# Original set of selectors in $dom->find() below is as follows:
# 'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########
print FILE "\n\t### Body\n\n";
for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {
print_phrase($phrase);
}
上面的代码块结合了@Grinnz 之前的建议,如本问题的评论所示。还要感谢@Robert 的回答,他很好地观察了 Mojo::DOM
的工作原理。
此代码绝对适用于我的应用程序。