带有 unicode 文本的 Tree Builder 问题
Tree Builder issue with unicode text
我正在使用 HTML::TreeBuilder
通过使用 tree->lookdown
提取 url 的内容,然后从 lookdown 方法返回的字符串中提取文本部分。我的问题是当我阅读该文本并将其写入显示为垃圾的文件时。我无法在这方面取得进展。
我的示例代码:
use HTML::TreeBuilder;
use HTML::Element;
use utf8;
$url = $ARGV[0];
$page = `wget -qO - "$url"| tee data.txt`;
#print "iam $page\n";
my $tree = HTML::TreeBuilder->new( );
$tree->parse_file('data.txt');
my @story = $tree->look_down(
_tag => 'div',
class => 'storydescription'
);
my @title = $tree->look_down(
_tag => 'title'
);
open(OUT,">","story.txt") or die"Cannot open story.txt:$!\n";
binmode(OUT,":utf8");
foreach my $story(@story) {
print OUT $story->as_text;
}
close(OUT);
我已尝试将 binmode 用于输出文件句柄,但它没有用,并且 Unicode 以外的文本(例如 ascii 字符)可以正确打印到文件中。
它记录在 HTML::TreeBuilder:
When you pass a filename to parse_file
, HTML::Parser
opens it in
binary mode, which means it's interpreted as Latin-1 (ISO-8859-1). If
the file is in another encoding, like UTF-8 or UTF-16, this will not
do the right thing.
One solution is to open the file yourself using the proper :encoding
layer, and pass the filehandle to parse_file
. You can automate this
process by using "html_file" in IO::HTML
, which will use the HTML5
encoding sniffing algorithm to automatically determine the proper
:encoding
layer and apply it.
我正在使用 HTML::TreeBuilder
通过使用 tree->lookdown
提取 url 的内容,然后从 lookdown 方法返回的字符串中提取文本部分。我的问题是当我阅读该文本并将其写入显示为垃圾的文件时。我无法在这方面取得进展。
我的示例代码:
use HTML::TreeBuilder;
use HTML::Element;
use utf8;
$url = $ARGV[0];
$page = `wget -qO - "$url"| tee data.txt`;
#print "iam $page\n";
my $tree = HTML::TreeBuilder->new( );
$tree->parse_file('data.txt');
my @story = $tree->look_down(
_tag => 'div',
class => 'storydescription'
);
my @title = $tree->look_down(
_tag => 'title'
);
open(OUT,">","story.txt") or die"Cannot open story.txt:$!\n";
binmode(OUT,":utf8");
foreach my $story(@story) {
print OUT $story->as_text;
}
close(OUT);
我已尝试将 binmode 用于输出文件句柄,但它没有用,并且 Unicode 以外的文本(例如 ascii 字符)可以正确打印到文件中。
它记录在 HTML::TreeBuilder:
When you pass a filename to
parse_file
,HTML::Parser
opens it in binary mode, which means it's interpreted as Latin-1 (ISO-8859-1). If the file is in another encoding, like UTF-8 or UTF-16, this will not do the right thing.One solution is to open the file yourself using the proper
:encoding
layer, and pass the filehandle toparse_file
. You can automate this process by using "html_file" inIO::HTML
, which will use the HTML5 encoding sniffing algorithm to automatically determine the proper:encoding
layer and apply it.