使用 Mojo::UserAgent 进行正则表达式匹配的解码结果
Decoding result with Mojo::UserAgent for regex match
我正在尝试找出为什么这行不通:
my $url = 'www880740.com';
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);
$ua->transactor->name( "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; Gecko/2008052906 Firefox/3.0" );
my $tx = $ua->get(
$url =>
{ 'Accept-Charset' => 'ISO-8859-1,utf-8;q=0.7,*;q=0.7' }
);
my $page_title = $tx->result->dom->at( 'title' )->text;
print "GOT: $page_title \n";
foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana Inherited Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{$type}/) {
print "$page_title seems to be $type!\n";
last;
}
}
基本上我想测试 URL 中的标题,并检查它是否与这些字符集匹配。我假设它是因为我需要将它解码成正则表达式可以找到的东西。当我将页面的“卷曲”版本放入内存时,它工作正常。 Devel::Peek::Dump 给我:
SV = PV(0x55cd8264d650) at 0x55cd824c4b10
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0x55cd82655d80 "1127436644181.com/11274366342371/253172515113/730311274366/253172514724/73031256|0425317217"[=12=]
CUR = 91
LEN = 96
COW_REFCNT = 0
更新:我终于开始工作了:
my $page_title = $tx->result->dom->at( 'title' )->text;
use Encode;
use Encode::Detect;
use Encode::HanExtra;
my $page_title = decode("Detect", $page_title);
print "GOT: $page_title \n";
foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana Inherited Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{Script_Extensions=$type}/) {
print "$page_title seems to be $type!\n";
last;
}
}
这位:
my $page_title = decode("Detect", $page_title);
检测尝试检测编码,然后转换为 Perl 的内部表示(准备好让我的正则表达式工作)。我尝试 post 我的示例输出,但由于某种原因它触发了垃圾邮件?
标题在charset=gb2312
中,需要解码成perl内部表示
以下代码演示了解码和输出以控制此特定网站的标题。
use strict;
use warnings;
use feature 'say';
use utf8;
use Mojo::UserAgent;
use Encode qw/encode decode/;
binmode STDOUT, 'encoding(UTF-8)';
my $url = 'www880740.com';
my $ua = Mojo::UserAgent->new->max_redirects(3);
$ua->transactor->name( 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; Gecko/2008052906 Firefox/3.0' );
my $res = $ua->get( $url )->result;
my $page_title = decode('euc-cn',$res->dom->at('title')->text);
say 'GOT: ' . $page_title;
exit;
my @langs = qw/Arabic Armenian Bengali Bopomofo Braille Buhid
Canadian_Aboriginal Cherokee Cyrillic Devanagari
Ethiopic Georgian Greek Gujarati Gurmukhi Han
Hangul Hanunoo Hebrew Hiragana Inherited Kannada
Katakana Khmer Lao Limbu Malayalam Mongolian
Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog
Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/;
for( @langs ) {
say "$page_title matches $_!" if $page_title =~ /\p{$_}/;
}
我正在尝试找出为什么这行不通:
my $url = 'www880740.com';
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);
$ua->transactor->name( "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; Gecko/2008052906 Firefox/3.0" );
my $tx = $ua->get(
$url =>
{ 'Accept-Charset' => 'ISO-8859-1,utf-8;q=0.7,*;q=0.7' }
);
my $page_title = $tx->result->dom->at( 'title' )->text;
print "GOT: $page_title \n";
foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana Inherited Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{$type}/) {
print "$page_title seems to be $type!\n";
last;
}
}
基本上我想测试 URL 中的标题,并检查它是否与这些字符集匹配。我假设它是因为我需要将它解码成正则表达式可以找到的东西。当我将页面的“卷曲”版本放入内存时,它工作正常。 Devel::Peek::Dump 给我:
SV = PV(0x55cd8264d650) at 0x55cd824c4b10
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0x55cd82655d80 "1127436644181.com/11274366342371/253172515113/730311274366/253172514724/73031256|0425317217"[=12=]
CUR = 91
LEN = 96
COW_REFCNT = 0
更新:我终于开始工作了:
my $page_title = $tx->result->dom->at( 'title' )->text;
use Encode;
use Encode::Detect;
use Encode::HanExtra;
my $page_title = decode("Detect", $page_title);
print "GOT: $page_title \n";
foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana Inherited Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{Script_Extensions=$type}/) {
print "$page_title seems to be $type!\n";
last;
}
}
这位:
my $page_title = decode("Detect", $page_title);
检测尝试检测编码,然后转换为 Perl 的内部表示(准备好让我的正则表达式工作)。我尝试 post 我的示例输出,但由于某种原因它触发了垃圾邮件?
标题在charset=gb2312
中,需要解码成perl内部表示
以下代码演示了解码和输出以控制此特定网站的标题。
use strict;
use warnings;
use feature 'say';
use utf8;
use Mojo::UserAgent;
use Encode qw/encode decode/;
binmode STDOUT, 'encoding(UTF-8)';
my $url = 'www880740.com';
my $ua = Mojo::UserAgent->new->max_redirects(3);
$ua->transactor->name( 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; Gecko/2008052906 Firefox/3.0' );
my $res = $ua->get( $url )->result;
my $page_title = decode('euc-cn',$res->dom->at('title')->text);
say 'GOT: ' . $page_title;
exit;
my @langs = qw/Arabic Armenian Bengali Bopomofo Braille Buhid
Canadian_Aboriginal Cherokee Cyrillic Devanagari
Ethiopic Georgian Greek Gujarati Gurmukhi Han
Hangul Hanunoo Hebrew Hiragana Inherited Kannada
Katakana Khmer Lao Limbu Malayalam Mongolian
Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog
Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/;
for( @langs ) {
say "$page_title matches $_!" if $page_title =~ /\p{$_}/;
}