scrapy 蜘蛛的问题

Issue with scrapy spider

我正在尝试从 moneycontrol.com 网站获取股票的成交量加权平均价格。 parse 函数是 运行 没有任何问题,但 parse_links 函数没有被调用。我在这里错过了什么吗?

# -*- coding: utf-8 -*-
import scrapy

class MoneycontrolSpider(scrapy.Spider):
    name = "moneycontrol"
    allowed_domains = ["https://www.moneycontrol.com"]
    start_urls = ["https://www.moneycontrol.com/india/stockpricequote"]

    def parse(self,response):
        for link in response.css('td.last > a::attr(href)').extract():
            if(link):
                yield scrapy.Request(link, callback=self.parse_links,method='GET')

    def parse_links(self, response):
        VWAP= response.xpath('//*[@id="n_vwap_val"]/text()').extract_first()
        print(VWAP)      
        with open('quotes.txt','a+') as f:
            f.write('VWAP: {}'.format(VWAP)  + '\n')

如果您阅读日志输出,错误就会变得很明显。

2018-09-08 19:52:38 [py.warnings] WARNING: c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py:59: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.moneycontrol.com in allowed_domains.
  warnings.warn("allowed_domains accepts only domains, not URLs. Ignoring URL entry %s in allowed_domains." % domain, URLWarning)

2018-09-08 19:52:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-08 19:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.moneycontrol.com/india/stockpricequote> (referer: None)
2018-09-08 19:52:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.moneycontrol.com': <GET http://www.moneycontrol.com/india/stockpricequote/chemicals/aartiindustries/AI45>

所以只需修复您的 allowed_domains,您应该没问题:

allowed_domains = ["moneycontrol.com"]