Symfony Dom 爬虫缺少节点,行为不一致
Symfony Dom Crawler Missing Node, Inconsistent Behaviour
使用此代码:
use Symfony\Component\DomCrawler\Crawler;
require_once(__DIR__ . '/../vendor/autoload.php');
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<p class="message">Hello World!</p>
<p>Hello Crawler!</p>
<p>OUTSIDE
<span>
Child SPAN
</span>
<div>
Child DIV
</div>
<p>
Child PARAGRAPH
</p>
</p>
</body>
</html>
HTML;
$crawler = new Crawler($html);
$crawlerFiltered = $crawler->filter('body > p');
$results = [];
$childResults = [];
for ($i=0; $i<count($crawlerFiltered); $i++) {
$results[] = $crawlerFiltered->eq($i)->html();
$children = $crawlerFiltered->eq($i)->children();
if (count($children)) {
for ($j=0; $j<count($children); $j++) {
$childResults[] = $children->eq($j)->html();
}
}
}
echo 'Parent Nodes:' . PHP_EOL;
var_export($results);
echo PHP_EOL;
echo 'Child Nodes:' . PHP_EOL;
var_export($childResults);
我得到结果:
Parent Nodes:
array (
0 => 'Hello World!',
1 => 'Hello Crawler!',
2 => 'OUTSIDE
<span>
Child SPAN
</span>
',
3 => '
Child PARAGRAPH
',
)
Child Nodes:
array (
0 => '
Child SPAN
',
)
代表以下问题:
- Child 结果:无 DIV 或 P(仅内嵌标签)
- Parent 结果:PHARAGRAPH 没有标签,与 SPAN
不一致
- Parent 结果:应该只包含第一个
p
因为第二个 p
(PHARAGRAPH) 不
body
为 parent 但 p
你知道这是为什么吗?如何解决上述问题?
The documentation for this component 状态:
Note
The DomCrawler will attempt to automatically fix your HTML to match the official specification. For example, if you nest a <p>
tag inside another <p>
tag, it will be moved to be a sibling of the parent tag. This is expected and is part of the HTML5 spec.
使用内置 DomDocument classes. Most HTML parsers are designed to deal with "tag soup" 可能会更好,并且会尝试纠正感知到的问题。
使用此代码:
use Symfony\Component\DomCrawler\Crawler;
require_once(__DIR__ . '/../vendor/autoload.php');
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<p class="message">Hello World!</p>
<p>Hello Crawler!</p>
<p>OUTSIDE
<span>
Child SPAN
</span>
<div>
Child DIV
</div>
<p>
Child PARAGRAPH
</p>
</p>
</body>
</html>
HTML;
$crawler = new Crawler($html);
$crawlerFiltered = $crawler->filter('body > p');
$results = [];
$childResults = [];
for ($i=0; $i<count($crawlerFiltered); $i++) {
$results[] = $crawlerFiltered->eq($i)->html();
$children = $crawlerFiltered->eq($i)->children();
if (count($children)) {
for ($j=0; $j<count($children); $j++) {
$childResults[] = $children->eq($j)->html();
}
}
}
echo 'Parent Nodes:' . PHP_EOL;
var_export($results);
echo PHP_EOL;
echo 'Child Nodes:' . PHP_EOL;
var_export($childResults);
我得到结果:
Parent Nodes:
array (
0 => 'Hello World!',
1 => 'Hello Crawler!',
2 => 'OUTSIDE
<span>
Child SPAN
</span>
',
3 => '
Child PARAGRAPH
',
)
Child Nodes:
array (
0 => '
Child SPAN
',
)
代表以下问题:
- Child 结果:无 DIV 或 P(仅内嵌标签)
- Parent 结果:PHARAGRAPH 没有标签,与 SPAN 不一致
- Parent 结果:应该只包含第一个
p
因为第二个p
(PHARAGRAPH) 不body
为 parent 但p
你知道这是为什么吗?如何解决上述问题?
The documentation for this component 状态:
Note
The DomCrawler will attempt to automatically fix your HTML to match the official specification. For example, if you nest a
<p>
tag inside another<p>
tag, it will be moved to be a sibling of the parent tag. This is expected and is part of the HTML5 spec.
使用内置 DomDocument classes. Most HTML parsers are designed to deal with "tag soup" 可能会更好,并且会尝试纠正感知到的问题。