使用 PHP DOMXpath 解析 HTML
Parsing HTML with PHP DOMXpath
我想使用 PHP 和 DOMXpath 从外部网站检索事件链接和文本。外部网站结构html如下;
<!-- first -->
<div class="col-sm-12 col-lg-3 me recording-item">
<div class="recording-item-inner">
<a class="col-sm-12 recording-name" href="/recordings/191">
<div class="info">
<b>Daily Event</b><br>
<small>29 Jun 2020</small>
</div></a>
</div>
</div>
<!-- second -->
<div class="col-sm-12 col-lg-3 me recording-item">
<div class="recording-item-inner">
<a class="col-sm-12 recording-name" href="/recordings/190">
<div class="info">
<b>Daily Event B</b><br>
<small>26 Jun 2020</small>
</div></a>
</div>
</div>
<!-- third -->
<div class="col-sm-12 col-lg-3 me recording-item">
<div class="recording-item-inner">
<a class="col-sm-12 recording-name" href="/recordings/189">
<div class="info">
<b>Daily Event C</b><br>
<small>22 Jun 2020</small>
</div></a>
</div>
</div>
我正在尝试检索最新的 5 个活动名称、日期和链接。目前我可以使用下面的代码获取最新的(单个)事件。
<?php
function getEvents()
{
$page = file_get_contents('https://example.com/events');
$rootUrl = 'https://example.com';
@$doc = new DOMDocument();
@$doc->loadHTML($page);
$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//div[@class='recording-item']");
$node = $nodeList->item(0);
$href = $xpath->evaluate("string(//div[@class='recording-item-inner']/a/@href)");
$eventUrl = $rootUrl . $href;
return $eventUrl;
}
?>
我如何修改这段代码,以便它检索 5 个最近的事件详细信息并打印出一个简单的项目列表;
<ul>
<li>Event 1 - [name], [date], [href]</li>
<li>Event 2 - [name], [date], [href]</li>
<li>Event 3 - [name], [date], [href]</li>
<li>Event 4 - [name], [date], [href]</li>
<li>Event 5 - [name], [date], [href]</li>
</ul>
可以做到,但由于 xpath 支持有限,这不是最优雅的解决方案。
从$nodeList
开始;鉴于您的示例 xml 只有 3 个事件,此代码将输出有关前两个的所需信息。显然,您可以根据自己的实际代码对其进行修改:
$nodeList = $xpath->query('//div[./div[@class="recording-item-inner"]]//div[@class="info"]');
$i = 1;
echo htmlspecialchars("<ul>", ENT_QUOTES);
echo "<br>";
foreach($nodeList as $result) {
if ($i++ > 2) break;
echo htmlspecialchars("<li>", ENT_QUOTES);
echo "Event 1 - " . $result->childNodes[1]->textContent . ", ";
echo $result->childNodes[4]->textContent . ", ";
echo $result->parentNode->getAttribute('href');
echo htmlspecialchars("</li>", ENT_QUOTES);
echo "<br>";
}
echo htmlspecialchars("</ul>", ENT_QUOTES);
输出:
<ul>
<li>Event 1 - Daily Event, 29 Jun 2020, /recordings/191</li>
<li>Event 1 - Daily Event B, 26 Jun 2020, /recordings/190</li>
</ul>
我想使用 PHP 和 DOMXpath 从外部网站检索事件链接和文本。外部网站结构html如下;
<!-- first -->
<div class="col-sm-12 col-lg-3 me recording-item">
<div class="recording-item-inner">
<a class="col-sm-12 recording-name" href="/recordings/191">
<div class="info">
<b>Daily Event</b><br>
<small>29 Jun 2020</small>
</div></a>
</div>
</div>
<!-- second -->
<div class="col-sm-12 col-lg-3 me recording-item">
<div class="recording-item-inner">
<a class="col-sm-12 recording-name" href="/recordings/190">
<div class="info">
<b>Daily Event B</b><br>
<small>26 Jun 2020</small>
</div></a>
</div>
</div>
<!-- third -->
<div class="col-sm-12 col-lg-3 me recording-item">
<div class="recording-item-inner">
<a class="col-sm-12 recording-name" href="/recordings/189">
<div class="info">
<b>Daily Event C</b><br>
<small>22 Jun 2020</small>
</div></a>
</div>
</div>
我正在尝试检索最新的 5 个活动名称、日期和链接。目前我可以使用下面的代码获取最新的(单个)事件。
<?php
function getEvents()
{
$page = file_get_contents('https://example.com/events');
$rootUrl = 'https://example.com';
@$doc = new DOMDocument();
@$doc->loadHTML($page);
$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//div[@class='recording-item']");
$node = $nodeList->item(0);
$href = $xpath->evaluate("string(//div[@class='recording-item-inner']/a/@href)");
$eventUrl = $rootUrl . $href;
return $eventUrl;
}
?>
我如何修改这段代码,以便它检索 5 个最近的事件详细信息并打印出一个简单的项目列表;
<ul>
<li>Event 1 - [name], [date], [href]</li>
<li>Event 2 - [name], [date], [href]</li>
<li>Event 3 - [name], [date], [href]</li>
<li>Event 4 - [name], [date], [href]</li>
<li>Event 5 - [name], [date], [href]</li>
</ul>
可以做到,但由于 xpath 支持有限,这不是最优雅的解决方案。
从$nodeList
开始;鉴于您的示例 xml 只有 3 个事件,此代码将输出有关前两个的所需信息。显然,您可以根据自己的实际代码对其进行修改:
$nodeList = $xpath->query('//div[./div[@class="recording-item-inner"]]//div[@class="info"]');
$i = 1;
echo htmlspecialchars("<ul>", ENT_QUOTES);
echo "<br>";
foreach($nodeList as $result) {
if ($i++ > 2) break;
echo htmlspecialchars("<li>", ENT_QUOTES);
echo "Event 1 - " . $result->childNodes[1]->textContent . ", ";
echo $result->childNodes[4]->textContent . ", ";
echo $result->parentNode->getAttribute('href');
echo htmlspecialchars("</li>", ENT_QUOTES);
echo "<br>";
}
echo htmlspecialchars("</ul>", ENT_QUOTES);
输出:
<ul>
<li>Event 1 - Daily Event, 29 Jun 2020, /recordings/191</li>
<li>Event 1 - Daily Event B, 26 Jun 2020, /recordings/190</li>
</ul>