如何抓取网页的标题和内容

Question

我有一个web-page，例如http://example.com/some-page。如果我将此 URL 传递给我的 PHP 函数，它应该获取页面的标题和内容。我试过这样抢标题：

function page_title($url) {
    $page = @file_get_contents($url);
    if (preg_match('~<h1 class="page-title">(.*)<\/h1>~is', $page, $matches)) {
        return $matches[0];
    }
}

echo page_title('http://example.com/some-page');

我的错误是什么？

Answer 1

你的功能其实差不多可以用了。我会提出 DOM 解析器解决方案（见下文），但在此之前我会指出正则表达式和代码中的一些弱点：

(.*) 捕获组是贪婪的，即它会捕获一个在结束 </h1> 之前尽可能长的字符串，甚至跨越换行符（因为s修饰符）。因此，如果您的文档有多个 h1 标签，它会捕获到最后一个标签！您可以通过将其设为惰性捕获来解决此问题：(.*?)
实际页面可能在标题内有其他标签，如 span。您可能想改进正则表达式以排除标题周围的任何标签，但是 PHP 有一个函数 strip_tags 用于此目的。
确保文件内容确实被检索到；一个错误可能阻止了正确的检索，或者您的服务器可能不允许这样的检索。当您使用 @ 前缀抑制错误时，您可能会错过它们。我建议删除 @。您还可以检查 return 值 false.
您确定要 H1 标签内容吗？一个页面通常有一个特定的 title 标签。

以上改进将为您提供此代码：

function page_title($url) {
    $page = file_get_contents($url);
    if ($page===false) {
        echo "Failed to retrieve $url";
    }
    if (preg_match('~<h1 class="page-title">(.*?)<\/h1>~is', $page, $matches)) {
        return strip_tags($matches[0]);
    }
}

虽然这行得通，但您迟早会遇到一个文档，该文档在 h1 标记中有一个额外的 space，或者在 class 之前有另一个属性，或者有更多比一种 css 风格等...使匹配失败。以下正则表达式将处理其中的一些问题：

~<h1\s+class\s*=\s*"([^" ]* )?page-title( [^"]*)?"[^>]*>(.*?)<\/h1\s*>~is

... 但是 class 属性必须位于任何其他属性之前，并且它的值必须用双引号引起来。也可以解决，但是正则表达式会变成怪物

DOM方式

正则表达式不是从 HTML 中提取内容的理想方式。这是一个基于 DOM 解析的替代函数：

function xpage_title($url) {
    // Create a new DOM Document to hold our webpage structure
    $xml = new DOMDocument();

    // Load the url's contents into the DOM, ignore warnings
    libxml_use_internal_errors(true);
    $success = $xml->loadHTMLFile($url);
    libxml_use_internal_errors(false);
    if (!$success) {
        echo "Failed to open $url.";
        return;
    }

    // Find first h1 with class 'page-title' and return it's text contents
    foreach($xml->getElementsByTagName('h1') as $h1) {
        // Does it have the desired class?
        if (in_array('page-title', explode(" ", $h1->getAttribute('class')))) {
            return $h1->textContent;
        }
    }
}

以上内容仍可通过使用 DOMXpath.

进行改进

编辑

您在评论中提到您实际上不想要 H1 标签的内容，因为它包含的文本比您想要的多。

然后你可以阅读title标签和article标签内容：

function page_title_and_content($url) {
    $page = file_get_contents($url);
    if ($page===false) {
        echo "Failed to retrieve $url";
    }
    // PHP 5.4: $result = (object) ["title" => null, "content" => null];
    $result = new stdClass();
    $result->title = null;
    $result->content = null;
    if (preg_match('~\<title\>(.*?)\<\/title\>~is', $page, $matches)) {
        $result->title = $matches[1];
    }
    if (preg_match('~<article>(.*)<\/article>~is', $page, $matches)) {
        $result->content = $matches[1];
    }
    return $result;
}

$result = page_title_and_content('http://www.example.com/example');
echo "title: " . $result->title . "<br>";
echo "content: <br>" . $result->content . "<br>";

以上代码将 return 一个具有两个属性的 object：title 和 content。请注意，content 属性将具有 HTML 标签，可能带有图像等。如果您不需要标签，请应用 strip_tags.

如何抓取网页的标题和内容

How to grab title and content of web page

php

fetch

preg-match