如何使用 Web::Scraper 抓取以下内容?
How do I scrape the following using Web::Scraper?
此问题与 不同但相关。
我必须使用 Web::Scraper 抓取一个页面,其中 HTML 可以稍微改变。有时可以
<div>
<p>
<strong>TITLE1</strong>
<br>
DESCRIPTION1
</p>
<p>
<strong>TITLE2</strong>
<br>
DESCRIPTION2
</p>
<p>
<strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div>
我正在使用以下代码Web::Scraper
提取
my $test = scraper {
process 'div p', 'test[]' => scraper {
process 'p strong', 'name' => 'TEXT';
process '//p/text()', 'desc' => [ 'TEXT', sub { s/^\s+|\s+$//g } ];
};
};
但有时它包含以下 HTML(请注意,每个标题和描述不再由 <p>
分隔)。
<div>
<p>
<strong>TITLE1</strong>
<br>
DESCRIPTION1
<strong>TITLE2</strong>
<br>
DESCRIPTION2
<strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div>
如何将上面的 HTML 抓取到
test => [
{ desc => "DESCRIPTION1 ", name => "TITLE1" },
{ desc => "DESCRIPTION2 ", name => "TITLE2" },
{ desc => "DESCRIPTION3 ", name => "TITLE3" },
]
我试过修改上面的代码,但我无法弄清楚 HTML 使用什么来 'split' 唯一的标题和描述对。
我从未使用过 WebScraper,但它的行为似乎有问题或很奇怪。
以下 XPath 表达式或多或少应该适用于这两种情况(需要进行小的调整):
//div//strong/text()
//div//br/following-sibling::text()
将这些插入 xmllint
(libxml2) 时:
tmp >xmllint --html --shell a.html
/ > cat /
-------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div>
<p>
<strong>TITLE1</strong>
<br>
DESCRIPTION1
</p>
<p>
<strong>TITLE2</strong>
<br>
DESCRIPTION2
</p>
<p>
<strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div>
</body></html>
/ > xpath //div//strong/text()
Object is a Node Set :
Set contains 3 nodes:
1 TEXT
content=TITLE1
2 TEXT
content=TITLE2
3 TEXT
content=TITLE3
/ > xpath //div//br/following-sibling::text()
Object is a Node Set :
Set contains 3 nodes:
1 TEXT
content= DESCRIPTION1
2 TEXT
content= DESCRIPTION2
3 TEXT
content= DESCRIPTION3
/ > load b.html
/ > cat /
-------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
<p>
<strong>TITLE1</strong>
<br>
DESCRIPTION1
<strong>TITLE2</strong>
<br>
DESCRIPTION2
<strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div></body></html>
/ > xpath //div//strong/text()
Object is a Node Set :
Set contains 3 nodes:
1 TEXT
content=TITLE1
2 TEXT
content=TITLE2
3 TEXT
content=TITLE3
/ > xpath //div//br/following-sibling::text()
Object is a Node Set :
Set contains 5 nodes:
1 TEXT
content= DESCRIPTION1
2 TEXT
content=
3 TEXT
content= DESCRIPTION2
4 TEXT
content=
5 TEXT
content= DESCRIPTION3
当您将这些的各种版本插入 WebScraper 时,它们不起作用。
process '//div', 'test[]' => scraper {
process '//strong', 'name' => 'TEXT';
process '//br/following-sibling::text()', 'desc' => 'TEXT';
};
结果:
/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
process '//div', 'test[]' => scraper {
process '//div//strong', 'name' => 'TEXT';
process '//div//br/following-sibling::text()', 'desc' => 'TEXT';
};
结果:
/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
即使是最基本的情况:
process 'div', 'test[]' => scraper {
process 'strong', 'name' => 'TEXT';
};
结果:
/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ name => "TITLE1" }] }
{ test => [{ name => "TITLE1" }] }
即使您通过 use Web::Scraper::LibXML
告诉它使用 libxml2 -nothing!
为了确保我不会发疯,我尝试使用 Ruby 的 Nokogiri:
/tmp >for f in a b; do ruby -rnokogiri -rpp -e'pp Nokogiri::HTML(File.read(ARGV[0])).css("div p strong").map &:text' $f.html; done
["TITLE1", "TITLE2", "TITLE3"]
["TITLE1", "TITLE2", "TITLE3"]
我缺少什么。
我想我已经解决了。我不确定这是否是最好的方法,但它似乎可以处理这两种情况。
my $test = scraper {
process '//div', 'test' => scraper {
process '//div//strong//text()', 'name[]' => 'TEXT';
process '//p/text()','desc[]' => ['TEXT', sub { s/^\s+|\s+$//g} ];
}
};
my $res = $test->scrape($html);
#get the names and descriptions
my @keys = @{$res->{test}->{name}};
my @values = @{$res->{test}->{desc}};
#merge two arrays into hash
my %hash;
@hash{@keys} = @values;
此问题与
我必须使用 Web::Scraper 抓取一个页面,其中 HTML 可以稍微改变。有时可以
<div>
<p>
<strong>TITLE1</strong>
<br>
DESCRIPTION1
</p>
<p>
<strong>TITLE2</strong>
<br>
DESCRIPTION2
</p>
<p>
<strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div>
我正在使用以下代码Web::Scraper
提取
my $test = scraper {
process 'div p', 'test[]' => scraper {
process 'p strong', 'name' => 'TEXT';
process '//p/text()', 'desc' => [ 'TEXT', sub { s/^\s+|\s+$//g } ];
};
};
但有时它包含以下 HTML(请注意,每个标题和描述不再由 <p>
分隔)。
<div>
<p>
<strong>TITLE1</strong>
<br>
DESCRIPTION1
<strong>TITLE2</strong>
<br>
DESCRIPTION2
<strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div>
如何将上面的 HTML 抓取到
test => [
{ desc => "DESCRIPTION1 ", name => "TITLE1" },
{ desc => "DESCRIPTION2 ", name => "TITLE2" },
{ desc => "DESCRIPTION3 ", name => "TITLE3" },
]
我试过修改上面的代码,但我无法弄清楚 HTML 使用什么来 'split' 唯一的标题和描述对。
我从未使用过 WebScraper,但它的行为似乎有问题或很奇怪。
以下 XPath 表达式或多或少应该适用于这两种情况(需要进行小的调整):
//div//strong/text()
//div//br/following-sibling::text()
将这些插入 xmllint
(libxml2) 时:
tmp >xmllint --html --shell a.html
/ > cat /
-------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div>
<p>
<strong>TITLE1</strong>
<br>
DESCRIPTION1
</p>
<p>
<strong>TITLE2</strong>
<br>
DESCRIPTION2
</p>
<p>
<strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div>
</body></html>
/ > xpath //div//strong/text()
Object is a Node Set :
Set contains 3 nodes:
1 TEXT
content=TITLE1
2 TEXT
content=TITLE2
3 TEXT
content=TITLE3
/ > xpath //div//br/following-sibling::text()
Object is a Node Set :
Set contains 3 nodes:
1 TEXT
content= DESCRIPTION1
2 TEXT
content= DESCRIPTION2
3 TEXT
content= DESCRIPTION3
/ > load b.html
/ > cat /
-------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
<p>
<strong>TITLE1</strong>
<br>
DESCRIPTION1
<strong>TITLE2</strong>
<br>
DESCRIPTION2
<strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div></body></html>
/ > xpath //div//strong/text()
Object is a Node Set :
Set contains 3 nodes:
1 TEXT
content=TITLE1
2 TEXT
content=TITLE2
3 TEXT
content=TITLE3
/ > xpath //div//br/following-sibling::text()
Object is a Node Set :
Set contains 5 nodes:
1 TEXT
content= DESCRIPTION1
2 TEXT
content=
3 TEXT
content= DESCRIPTION2
4 TEXT
content=
5 TEXT
content= DESCRIPTION3
当您将这些的各种版本插入 WebScraper 时,它们不起作用。
process '//div', 'test[]' => scraper {
process '//strong', 'name' => 'TEXT';
process '//br/following-sibling::text()', 'desc' => 'TEXT';
};
结果:
/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
process '//div', 'test[]' => scraper {
process '//div//strong', 'name' => 'TEXT';
process '//div//br/following-sibling::text()', 'desc' => 'TEXT';
};
结果:
/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
即使是最基本的情况:
process 'div', 'test[]' => scraper {
process 'strong', 'name' => 'TEXT';
};
结果:
/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ name => "TITLE1" }] }
{ test => [{ name => "TITLE1" }] }
即使您通过 use Web::Scraper::LibXML
告诉它使用 libxml2 -nothing!
为了确保我不会发疯,我尝试使用 Ruby 的 Nokogiri:
/tmp >for f in a b; do ruby -rnokogiri -rpp -e'pp Nokogiri::HTML(File.read(ARGV[0])).css("div p strong").map &:text' $f.html; done
["TITLE1", "TITLE2", "TITLE3"]
["TITLE1", "TITLE2", "TITLE3"]
我缺少什么。
我想我已经解决了。我不确定这是否是最好的方法,但它似乎可以处理这两种情况。
my $test = scraper {
process '//div', 'test' => scraper {
process '//div//strong//text()', 'name[]' => 'TEXT';
process '//p/text()','desc[]' => ['TEXT', sub { s/^\s+|\s+$//g} ];
}
};
my $res = $test->scrape($html);
#get the names and descriptions
my @keys = @{$res->{test}->{name}};
my @values = @{$res->{test}->{desc}};
#merge two arrays into hash
my %hash;
@hash{@keys} = @values;