Twisted HTTP 客户端下载整个页面并测量下载时间

Question

我尝试使用 Twisted Agent 来实现 HTTP 客户端并下载特定 URL 的完整网页，最后测量该特定网页的加载时间。不幸的是，我想出的代码不遵循 HTML 标签中的内部 URLs，所以即使从其他网站下载一些内容后，网页也需要 10 秒才能完全加载浏览器将在不到一秒钟的时间内完全加载我的代码，这表明我的代码不正确！即使我使用 BrowserLikeRedirectAgent 和 RedirectAgent 结果也是一样的。欢迎任何评论。

def init_http(url):
    userAgent = 'Twisted/%s (httpclient.py)' % (version.short(),)
    agent = BrowserLikeRedirectAgent(Agent(reactor))

    def response_time_calculator(test,t1):
        end_time = time.time()
        response_time = end_time - t1
        print ("Got the Whole page in:  ", response_time)

    start_time = time.time()

    d = agent.request(
        b'GET', str(url), Headers({'user-agent': [userAgent]}))
    def cbResponse(response):
        if response.length is not UNKNOWN_LENGTH:
            print('The response body will consist of', response.length, 'bytes.')
        else:
            print('The response body length is unknown.')
        d = readBody(response)
        d.addCallback(response_time_calculator, start_time)
        return d
    d.addCallback(cbResponse)

Answer 1

time.clock 仅测量 Windows 上的挂钟时间（奇怪）。使用 time.time 测量所有平台上的挂钟时间。

此外，您还必须实现跟随链接的部分。 Agent.request 准确下载您请求的资源。如果该资源是一些 HTML 并带有指向其他资源的链接，则您必须解析数据、提取链接并关注它们。

您可能想看看 scrapy。如果没有，您可以添加一个稍微小一点（功能较少）的依赖项，例如 html5lib。类似于：

    d = readBody(response)
    d.addCallback(load_images)
    d.addCallback(response_time_calculator, start_time)

...

from twisted.internet.defer import gatherResults
import html5lib

def load_images(html_bytes):
    image_requests = []
    doc = html5lib.parse(html_bytes)
    for img in doc.xpath("//img"):
        d = agent.request(img.src)
        d.addCallback(readBody)
        image_requests.append(d)
    return gatherResults(image_requests)

我省略了适当的 url 解析（即，处理 img src 中的相关链接）并且还没有实际测试它。它可能有很多错误，但希望能使这个想法清晰。

Twisted HTTP 客户端下载整个页面并测量下载时间

Twisted HTTP client to download a whole page and measure the download time

http

twisted