如何创建线程池

Question

我正在尝试抓取此站点上的所有产品： https://www.jny.com/collections/jackets

它将获取所有产品的链接，然后将它们一一抓取。我试图通过多线程来加速这个过程。这是代码：

def yield1(self, url):
    print("inside function")
    yield scrapy.Request(url, callback=self.parse_product)


def parse(self, response):
    print("in herre")
    self.product_url =  response.xpath('//div[@class = "collection-grid js-filter-grid"]//a/@href').getall()
    print(self.product_url)
    for pu in self.product_url:
        print("inside the loop")
        with ThreadPoolExecutor(max_workers=10) as executor:
             print("inside thread")
             executor.map(self.yield1, response.urljoin(pu))

它应该创建一个包含 10 个线程的池，每个线程将对 URL 列表执行 yield1()。问题是没有调用 yield1() 方法。

Answer 1

yield1 是生成器函数。要让它产生一个值，您必须对其调用 next。更改它，使其 returns 成为一个值

def yield1(self, url):
    print("inside function")
    return scrapy.Request(url, callback=self.parse_product)

警告：我对 Scrapy 真的一无所知。

Overview in the docs says that requests are made asynchronously. Your code doesn't look like the examples given in those docs. The example in the overview shows subsequent requests being made in the parse method using response.follow. Your code looks like you are trying to extract links from a page and then asynchronously scrape those links and parse them with a different method. Since it seems like Scrapy will do this for you and handle the asynchronicity (?) I think you just need to define another parse method in your spider and use response.follow to schedule more asynchronous requests. You shouldn't need concurrent.futures, the new requests should all be processed asynchrounously.

我无法对此进行测试，但我认为您的蜘蛛应该看起来更像这样：

class TempSpider(scrapy.Spider):
    name = 'foo'
    start_urls = [
        'https://www.jny.com/collections/jackets',
    ]
    def parse(self, response):
        self.product_url =  response.xpath('//div[@class = "collection-grid js-filter-grid"]//a/@href').getall()
        for pu in self.product_url:
            print("inside the loop")
            response.urljoin(pu)
            yield response.follow(response.urljoin(pu), self.parse_product)

    def parse_product(self, response):
        '''parses the product urls'''

这假设 self.product_url = response.xpath('//div[@class = "collection-grid js-filter-grid"]//a/@href').getall() 做了它应该做的事。

甚至可能是一个单独的蜘蛛来解析后续链接。或者使用 CrawlSpider .

相关 SO 问答
Scraping links with Scrapy
scraping web page containing anchor tag using scrapy
（眼熟）

many more

如何创建线程池

How to create a pool of threads

python

multithreading

scrapy

python-multithreading