用新 URL 再次调用 scrapy

Question

这是我的蜘蛛，它正在工作，但是，我怎么能在新发现的 URL 上发送另一个蜘蛛。现在，我存储所有以 HTTP、HTTPS 开头的链接，或者如果它是 /，我添加基础 URL。

然后我将迭代该数组并在新的 URL 上调用一个新的蜘蛛（它在代码的末尾）

我无法在新的 URL 上抓取（我知道是因为 print() 没有显示在控制台上）

import scrapy
import re

class GeneralSpider( scrapy.Spider ):
    name = "project"
    start_urls = ['https://www.url1.com/',
                'http://url2.com']

    def parse( self, response ):
        lead = {}
        lead['url'] = response.request.url
        lead['data'] = {}
        lead['data']['mail'] = []
        lead['data']['number'] = []

        selectors = ['//a','//p','//label','//span','//i','//b','//div'
                    '//h1','//h1','//h3','//h4','//h5','//h6','//tbody/tr/td']
        atags = []

        for selector in selectors:
            for item in response.xpath( selector ):
                name = item.xpath( 'text()' ).extract_first()
                href = item.xpath( '@href' ).extract_first()

                if selector == '//a' and href is not None and href !='' and href !='#':
                    if href.startswith("http") or href.startswith("https"):
                        atags.append( href )

                    elif href.startswith("/"):
                        atags.append( response.request.url + href )

                if href is not None and href !='' and href !='#':
                    splitted = href.split(':')

                    if splitted[0] not in lead['data']['mail'] and splitted[0] == 'mailto':
                        lead['data']['mail'].append(splitted[1])

                    elif splitted[0] not in lead['data']['number'] and splitted[0] == 'tel':
                        lead['data']['number'].append(splitted[1])

                else:
                    if name is not None and name != '':
                        mail_regex = re.compile( r'^(([^<>()[\]\.,;:\s@\"]+(\.[^<>()[\]\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$' )
                        number_regex = re.compile( r'^(?:\(\+?\d{2,3}\)|\+?\d{2,3})\s?(?:\d{4}[\s*.-]?\d{4}|\d{3}[\s*.-]?\d{3}|\d{2}([\s*.-]?)\d{2}?\d{2}(?:?\d{2})?)(?:?\d{2})?$' )

                        if name not in lead['data']['mail'] and re.match( mail_regex, name ):
                            lead['data']['mail'].append(name)

                        elif name not in lead['data']['number'] and re.match( number_regex, name ):
                            lead['data']['number'].append(name)

        print( lead )
        #I want here call parse method again but with new url
        for tag in atags:
            scrapy.Request( tag, callback=self.parse )

Answer 1

您需要 return 函数中的 Request 对象。由于您要生成多个，因此可以使用 yield，如下所示：

yield scrapy.Request(tag, callback=self.parse)

"In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects." See scrapy documentation

用新 URL 再次调用 scrapy

Call again scrapy with new URL

python

scrapy

python-3.x

scrapy-spider