使用 goutte 从文件/字符串中读取

Mangs

51人浏览 · 2022-09-24 20:51:43

Mangs · 2022-09-24 20:51:43 发布

问题:使用 goutte 从文件/字符串中读取

我正在使用 Goutte 制作网络刮板。

为了开发,我保存了一个我想遍历的 .html 文档(所以我不会经常向网站发出请求)。这是我到目前为止所拥有的:

use Goutte\Client;

$client = new Client();
$html=file_get_contents('test.html');
$crawler = $client->request(null,null,[],[],[],$html);

根据我所知道的应该调用 Symfony\Component\BrowserKit 中的请求,并传入原始正文数据。这是我收到的错误消息:

PHP Fatal error:  Uncaught exception 'GuzzleHttp\Exception\ConnectException' with message 'cURL error 7: Failed to connect to localhost port 80: Connection refused (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)' in C:\Users\Ally\Sites\scrape\vendor\guzzlehttp\guzzle\src\Handler\CurlFactory.

如果我只使用 DomCrawler,那么使用字符串创建爬虫并非易事。 (参见:http://symfony.com/doc/current/components/dom_crawler.html)。我只是不确定如何用 Goutte 做同样的事情。

提前致谢。

解答

您决定使用的工具会建立真正的 http 连接,但不适合您想要做的事情。至少开箱即用。

选项 1:实现您自己的 BrowserKit 客户端

goutte 所做的只是扩展了 BrowserKit 的客户端。它使用 Guzzle 实现 http 请求。

实现自己的客户端所需要做的就是扩展Symfony\Component\BrowserKit\Client并提供doRequest()方法:

use Symfony\Component\BrowserKit\Client;
use Symfony\Component\BrowserKit\Request;
use Symfony\Component\BrowserKit\Response;

class FilesystemClient extends Client
{
    /**
     * @param object $request An origin request instance
     *
     * @return object An origin response instance
     */
    protected function doRequest($request)
    {
        $file = $this->getFilePath($request->getUri());

        if (!file_exists($file)) {
            return new Response('Page not found', 404, []);
        }

        $content = file_get_contents($file);

        return new Response($content, 200, []);
    }

    private function getFilePath($uri)
    {
        // convert an uri to a file path to your saved response
        // could be something like this:
        return preg_replace('#[^a-zA-Z_\-\.]#', '_', $uri).'.html';
    }
}

 $client = new FilesystemClient();
 $client->request('GET', '/test');

客户端的request()需要接受真实的 URI,因此您需要实现自己的逻辑将其转换为文件系统位置。

查看Goutte Client以获得灵感。

选项 2:实现自定义 Guzzle 处理程序

由于 Goutte 使用 Guzzle,您可以提供自己的 Guzzle 处理程序来从文件加载响应,而不是发出真正的 http 请求。查看处理程序和中间件文档。

如果您只是在缓存响应以减少 http 请求之后,Guzzle 已经为此提供了支持。

**选项 3:直接使用 DomCrawler **

new Crawler(file_get_contents('test.html'))

唯一的缺点是您将失去 BrowserKit 客户端的一些便利方法,例如click()或selectLink()。

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia