检索文章最终 URL 和图像的最快捷有效的方法

Question

我编写了一个 PHP 脚本来解析 RSS 提要，并尝试从 og:image 元标记中获取开放图图像。

为了获取图像，我需要检查 RSS 提要中的 url 是否为 301 重定向。这经常发生，这意味着我需要遵循任何重定向到结果 URLs。这意味着脚本运行真的很慢。有没有更快更有效的方法来实现？

这是获取最终结果的函数 URL:

function curl_get_contents($url) {
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$result=curl_exec($ch);
return $result; 
}

这是检索 og 图像（如果存在）的函数：

function getog($url) {
    $doc = new DomDocument();
    $doc->loadHTML(curl_get_contents($url));
    if($doc == "") {return;}
    $xpath = new DOMXPath($doc);
    $query = '//*/meta[starts-with(@property, \'og:\')]';
    $queryT = '';
    $metas = $xpath->query($query);
    foreach ($metas as $meta) {
        $property = $meta->getAttribute('property');
        $content = $meta->getAttribute('content');
        if($property == "og:url"   && $ogProperty['url'] == "")     {$ogProperty['url'] = $content;}
        if($property == "og:title" && $ogProperty['title'] == "")   {$ogProperty['title'] = $content;}
        if($property == "og:image" && $ogProperty['image'] == "")   {$ogProperty['image'] = $content;}
    }
    return $ogProperty;
}

脚本还有很多，但这些功能是瓶颈。我也在缓存到一个文本文件，这意味着它在第一个运行.

之后更快

如何加快我的脚本检索最终 url 并从 RSS 提要中的链接获取图像 urls？

Answer 1

恐怕您无法加快提取过程本身。一种可能的改进是按字符串方式处理图像提取，即 - 虽然通常强烈建议不要 - 使用正则表达式关注 og: 标签。

这有主要缺点，如果对源代码进行了更改，则很容易破坏，并且没有显着速度优势更稳定 DOM 解析方法。

I'm also caching to a text file, which means it's faster after the first run.

另一方面，您可能会采用始终只为用户提供缓存的方法，并在每个请求需要时使用异步调用更新它。

正如 CBroe 对您的回答的评论：

There is no way to speed up following redirects. The client has to make a new request, and that takes the time it takes. With CURLOPT_FOLLOWLOCATION cURL does this automatically already, so there is no point where you could possibly interject to make anything faster.

这意味着它不是您的网络服务器上的一项繁重任务，而是一项冗长的任务，因为它必须处理大量请求履行。这是开始思考异步的非常的良好基础：

您收到一个寻找 RSS 项目的请求，
您从缓存中非常快速地提供响应，
您发送异步请求以在需要时重建缓存 - 由于重定向和 DOM 解析，这是最长的部分，但原始 client/peer 请求 RSS 项目列表不会必须等待此操作完成；也就是说，对于这个列表，发送重建请求本身只需要时间，几微秒，
你 return 缓存的项目。

Asynchronous shell exec in PHP

如果你走这条路，在你的情况下，你会遇到以下优势：

具有高加载速度的快速内容服务，
重建缓存时不降低加载速度

而且，以下缺点：

第一个请求更新提要的用户没有立即*收到最新的项目，
第一个用户之后的后续用户不会立即* 收到最新的项目，直到缓存准备就绪。

*好消息是，您可以几乎完美地 消除所有缺点 使用循环的定时 AJAX 请求检查如果 RSS 项目缓存中有任何新项目。

如果有，您可以在顶部（或底部）显示一条消息，通知用户新内容的到来，并在用户单击通知时附加该内容。

与简单地始终提供缓存内容而不使用循环 AJAX 调用相比，此方法可将实时 RSS 显示和项目在您网站上显示之间的延迟减少到最长时间 n + m，其中n是AJAX-请求间隔，m是重建缓存的时间

Answer 2

您可以使用 Facebook 的 OG API。 Facebook 使用它从任何 URL 中删除重要信息。与通常的抓取方法相比，它非常快。

你可以这样操作..

og_scrapping.php:

    function meta_scrap($url){
        $link = 'https://graph.facebook.com/?id='.$url.'&scrape=true&method=post';
        $ch = curl_get_contents($link);
        return json_decode($ch);
    }

然后在包含 og_scrapping.php 之后的任何地方简单地调用它 print_r(meta_scrap('http://www.example.com'); 您将得到一个数组，然后您可以根据需要获取选择性内容。

对于标题、图片、url 和描述，您可以通过以下方式获取：

$title = $output->title;
$image = $output->image[0]->url;
$description = $output->description;
$url = $output->url;

抓取图像时出现重大问题。获得标题和描述很容易。阅读this article to get images in a faster way. Also this将帮助您节省几秒钟。

Answer 3

元数据存储在 "head" 元素中。

在您的 Xpath 中，您必须考虑 head 元素：

$query = '//head/meta[starts-with(@property, \'og:\')]';

当您可以在 "head" 元素结束后停止检索时，您会浪费一些时间来检索、存储和解析整个 html 文件。另外，当你只想要 1k 时，为什么要得到 40k 的网页？

您 "might" 考虑在看到结尾的 "head" 元素后停止检索。它可以在没有其他事情可做时加快速度，但它是一个调皮的不总是工作的黑客。

检索文章最终 URL 和图像的最快捷有效的方法

Quickest & Efficient way of retrieving article final URL and images

php

curl