带有 unicode 文本的 Tree Builder 问题

Question

我正在使用 HTML::TreeBuilder 通过使用 tree->lookdown 提取 url 的内容，然后从 lookdown 方法返回的字符串中提取文本部分。我的问题是当我阅读该文本并将其写入显示为垃圾的文件时。我无法在这方面取得进展。

我的示例代码：

use HTML::TreeBuilder;
use HTML::Element;

use utf8;

$url = $ARGV[0];
$page = `wget -qO -  "$url"| tee data.txt`;
#print "iam $page\n";
my $tree = HTML::TreeBuilder->new(  );
$tree->parse_file('data.txt');

my @story = $tree->look_down(
    _tag  => 'div',
    class => 'storydescription'
);

my @title = $tree->look_down(
    _tag  => 'title'
);

open(OUT,">","story.txt") or die"Cannot open story.txt:$!\n";
binmode(OUT,":utf8");

foreach my $story(@story) {
    print OUT $story->as_text;
}
close(OUT);

我已尝试将 binmode 用于输出文件句柄，但它没有用，并且 Unicode 以外的文本（例如 ascii 字符）可以正确打印到文件中。

Answer 1

它记录在 HTML::TreeBuilder:

When you pass a filename to parse_file, HTML::Parser opens it in binary mode, which means it's interpreted as Latin-1 (ISO-8859-1). If the file is in another encoding, like UTF-8 or UTF-16, this will not do the right thing.

One solution is to open the file yourself using the proper :encoding layer, and pass the filehandle to parse_file. You can automate this process by using "html_file" in IO::HTML, which will use the HTML5 encoding sniffing algorithm to automatically determine the proper :encoding layer and apply it.

带有 unicode 文本的 Tree Builder 问题

Tree Builder issue with unicode text

unicode

perl

html-treebuilder