如何在 PHP 中使用 DomDocument 或 XPath 获取 HTML 文档的确切结构?
How to get the exact structure of an HTML document using DomDocument or XPath in PHP?
我有一个 HTML 文档,例如:
<!DOCTYPE html>
<html>
<head>
<title>Webpage</title>
</head>
<body>
<div class="content">
<div>
<p>Paragraph</p>
</div>
<div>
<a href="someurl">This is an anchor</a>
</div>
<p>This is a paragraph inside a div</p>
</div>
</body>
</html>
我想获取具有 content
的 class 的 div 的确切结构。
在 PHP 中使用 DomDocument 如果我使用 getElementsByTagName()
方法获取 div,我得到这个:
DOMElement Object
(
[tagName] => div
[schemaTypeInfo] =>
[nodeName] => div
[nodeValue] =>
Paragraph
This is an anchor
This is a paragraph inside a div
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[nextSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] => div
[baseURI] =>
[textContent] =>
Paragraph
This is an anchor
This is a paragraph inside a div
)
我怎样才能得到这个:
<div class="content">
<div>
<p>Paragraph</p>
</div>
<div>
<a href="someurl">This is an anchor</a>
</div>
<p>This is a paragraph inside a div</p>
</div>
有什么办法吗?
假设 $str 包含 HTML
// Create DomDocument
$doc = new DomDocument();
$doc->loadHTML($str);
// Find needed div
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[@class = "content"]');
// What to do if divs more that one?
if ($elements->length != 1) die("some divs in the document have class 'content'");
// Take first
$div = $elements->item(0);
// Echo content of node $div
echo $doc->saveHTML($div);
结果
<div class="content">
<div>
<p>Paragraph</p>
</div>
<div>
<a href="someurl">This is an anchor</a>
</div>
<p>This is a paragraph inside a div</p>
</div>
我有一个 HTML 文档,例如:
<!DOCTYPE html>
<html>
<head>
<title>Webpage</title>
</head>
<body>
<div class="content">
<div>
<p>Paragraph</p>
</div>
<div>
<a href="someurl">This is an anchor</a>
</div>
<p>This is a paragraph inside a div</p>
</div>
</body>
</html>
我想获取具有 content
的 class 的 div 的确切结构。
在 PHP 中使用 DomDocument 如果我使用 getElementsByTagName()
方法获取 div,我得到这个:
DOMElement Object
(
[tagName] => div
[schemaTypeInfo] =>
[nodeName] => div
[nodeValue] =>
Paragraph
This is an anchor
This is a paragraph inside a div
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[nextSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] => div
[baseURI] =>
[textContent] =>
Paragraph
This is an anchor
This is a paragraph inside a div
)
我怎样才能得到这个:
<div class="content">
<div>
<p>Paragraph</p>
</div>
<div>
<a href="someurl">This is an anchor</a>
</div>
<p>This is a paragraph inside a div</p>
</div>
有什么办法吗?
假设 $str 包含 HTML
// Create DomDocument
$doc = new DomDocument();
$doc->loadHTML($str);
// Find needed div
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[@class = "content"]');
// What to do if divs more that one?
if ($elements->length != 1) die("some divs in the document have class 'content'");
// Take first
$div = $elements->item(0);
// Echo content of node $div
echo $doc->saveHTML($div);
结果
<div class="content">
<div>
<p>Paragraph</p>
</div>
<div>
<a href="someurl">This is an anchor</a>
</div>
<p>This is a paragraph inside a div</p>
</div>