用新 URL 再次调用 scrapy

Call again scrapy with new URL

这是我的蜘蛛,它正在工作,但是,我怎么能在新发现的 URL 上发送另一个蜘蛛。现在,我存储所有以 HTTPHTTPS 开头的链接,或者如果它是 /,我添加基础 URL。

然后我将迭代该数组并在新的 URL 上调用一个新的蜘蛛(它在代码的末尾)

我无法在新的 URL 上抓取(我知道是因为 print() 没有显示在控制台上)

import scrapy
import re

class GeneralSpider( scrapy.Spider ):
    name = "project"
    start_urls = ['https://www.url1.com/',
                'http://url2.com']

    def parse( self, response ):
        lead = {}
        lead['url'] = response.request.url
        lead['data'] = {}
        lead['data']['mail'] = []
        lead['data']['number'] = []

        selectors = ['//a','//p','//label','//span','//i','//b','//div'
                    '//h1','//h1','//h3','//h4','//h5','//h6','//tbody/tr/td']
        atags = []

        for selector in selectors:
            for item in response.xpath( selector ):
                name = item.xpath( 'text()' ).extract_first()
                href = item.xpath( '@href' ).extract_first()

                if selector == '//a' and href is not None and href !='' and href !='#':
                    if href.startswith("http") or href.startswith("https"):
                        atags.append( href )

                    elif href.startswith("/"):
                        atags.append( response.request.url + href )

                if href is not None and href !='' and href !='#':
                    splitted = href.split(':')

                    if splitted[0] not in lead['data']['mail'] and splitted[0] == 'mailto':
                        lead['data']['mail'].append(splitted[1])

                    elif splitted[0] not in lead['data']['number'] and splitted[0] == 'tel':
                        lead['data']['number'].append(splitted[1])

                else:
                    if name is not None and name != '':
                        mail_regex = re.compile( r'^(([^<>()[\]\.,;:\s@\"]+(\.[^<>()[\]\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$' )
                        number_regex = re.compile( r'^(?:\(\+?\d{2,3}\)|\+?\d{2,3})\s?(?:\d{4}[\s*.-]?\d{4}|\d{3}[\s*.-]?\d{3}|\d{2}([\s*.-]?)\d{2}?\d{2}(?:?\d{2})?)(?:?\d{2})?$' )

                        if name not in lead['data']['mail'] and re.match( mail_regex, name ):
                            lead['data']['mail'].append(name)

                        elif name not in lead['data']['number'] and re.match( number_regex, name ):
                            lead['data']['number'].append(name)

        print( lead )
        #I want here call parse method again but with new url
        for tag in atags:
            scrapy.Request( tag, callback=self.parse )

您需要 return 函数中的 Request 对象。 由于您要生成多个,因此可以使用 yield,如下所示:

yield scrapy.Request(tag, callback=self.parse)

"In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects." See scrapy documentation