使用 goutte 从文件/字符串中读取

Question

我正在使用 Goutte 制作网络爬虫。

为了开发，我保存了一个我想遍历的 .html 文档（所以我不会经常向网站发出请求）。这是我目前所拥有的：

use Goutte\Client;

$client = new Client();
$html=file_get_contents('test.html');
$crawler = $client->request(null,null,[],[],[],$html);

据我所知应该调用 Symfony\Component\BrowserKit 中的请求，并传入原始正文数据。这是我收到的错误消息：

PHP Fatal error:  Uncaught exception 'GuzzleHttp\Exception\ConnectException' with message 'cURL error 7: Failed to connect to localhost port 80: Connection refused (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)' in C:\Users\Ally\Sites\scrape\vendor\guzzlehttp\guzzle\src\Handler\CurlFactory.

如果我只使用 DomCrawler，那么使用字符串创建爬虫并不是一件容易的事。（参见：http://symfony.com/doc/current/components/dom_crawler.html）。我只是不确定如何用 Goutte 做同样的事情。

提前致谢。

Answer 1

您决定使用的工具可建立真正的 http 连接，但不适合您的目的。至少开箱即用。

选项 1：实现您自己的 BrowserKit 客户端

goutte 所做的只是扩展了 BrowserKit 的 Client。它使用 Guzzle 实现 http 请求。

实现自己的客户端所需要做的就是扩展 Symfony\Component\BrowserKit\Client 并提供 doRequest() method:

use Symfony\Component\BrowserKit\Client;
use Symfony\Component\BrowserKit\Request;
use Symfony\Component\BrowserKit\Response;

class FilesystemClient extends Client
{
    /**
     * @param object $request An origin request instance
     *
     * @return object An origin response instance
     */
    protected function doRequest($request)
    {
        $file = $this->getFilePath($request->getUri());

        if (!file_exists($file)) {
            return new Response('Page not found', 404, []);
        }

        $content = file_get_contents($file);

        return new Response($content, 200, []);
    }

    private function getFilePath($uri)
    {
        // convert an uri to a file path to your saved response
        // could be something like this:
        return preg_replace('#[^a-zA-Z_\-\.]#', '_', $uri).'.html';
    }
}

 $client = new FilesystemClient();
 $client->request('GET', '/test');

客户端的 request() 需要接受真实的 URI，因此您需要实现自己的逻辑以将其转换为文件系统位置。

看看 Goutte's Client 获得启发。

选项 2：实施自定义 Guzzle 处理程序

由于 Goutte 使用 Guzzle，您可以提供自己的 Guzzle 处理程序来加载来自文件的响应，而不是发出真正的 http 请求。看看 handlers and middleware doc.

如果您只是在缓存响应之后发出较少的 http 请求，Guzzle 已经为此提供了支持。

方案三：直接使用DomCrawler

new Crawler(file_get_contents('test.html'))

唯一的缺点是您会失去 BrowserKit 客户端的一些便捷方法，例如 click() 或 selectLink()。

使用 goutte 从文件/字符串中读取

Using goutte to read from a file / string

php

web-scraping

symfony

goutte