Issue running Scrapy spider from script. Error: DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor

Issue running Scrapy spider from script. Error: DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor

这是蜘蛛程序的代码。我正在尝试使用 Scrapy 蜘蛛抓取这些链接并将输出作为 csv。我用 beautiful soup 单独测试了 CSS 选择器并抓取了所需的链接,但无法将此蜘蛛获取到 运行。我还尝试在设置中考虑 DEBUG 消息,但到目前为止还没有成功。请帮忙


    in[1]: 
class espn_spider(scrapy.Spider):
    name = "fsu2021_spider"
    def start_requests(self):
        urls = ["https://www.espn.com/college-football/team/_/id/52"]
        for url in urls:
            yield scrapy.Request(url = url, callback = self.parse)
    def parse(self, response):
        links = response.css('div.global-nav-container li > a::attr(href)')
        link = links.extract()
process = CrawlerProcess(settings = {
    "REACTOR": "twisted.internet.selectreactor.SelectReactor", 
    "FEED_URI": "fsu21.csv", 
    "FEED_FORMAT": "csv"})
process.crawl(espn_spider)
process.start()

out[1]:
2021-12-24 13:25:54 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-12-24 13:25:54 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 21.7.0, Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19041-SP0
2021-12-24 13:25:54 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-12-24 13:25:54 [scrapy.crawler] INFO: Overridden settings:
{}
2021-12-24 13:25:54 [scrapy.extensions.telnet] INFO: Telnet Password: c886149440d51d5d
2021-12-24 13:25:54 [py.warnings] WARNING: C:\Users\gtham\Anaconda3\lib\site-packages\scrapy\extensions\feedexport.py:247: ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been deprecated in favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
  exporter = cls(crawler)

2021-12-24 13:25:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2021-12-24 13:25:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-24 13:25:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-24 13:25:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-12-24 13:25:54 [scrapy.core.engine] INFO: Spider opened
2021-12-24 13:25:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-24 13:25:54 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-24 13:25:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.espn.com/college-football/team/_/id/52> (referer: None)
2021-12-24 13:25:55 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-24 13:25:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 245,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 78252,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.68894,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 24, 18, 25, 55, 566234),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 12, 24, 18, 25, 54, 877294)}
2021-12-24 13:25:55 [scrapy.core.engine] INFO: Spider closed (finished)

只是一个猜测 - 您可能面临一个动态加载网页,如果没有 selenium 的帮助,scrapy 无法直接抓取该网页。

我在添加 headers 的帮助下设置了一些 loggers,但我没有从 start_requests 中得到任何东西。这就是为什么我像以前一样做出假设。

另外请注意,我用 splash 再次尝试了这个,它成功了。

这是它的代码:

import scrapy
from scrapy_splash import SplashRequest


class espn_spider(scrapy.Spider):
    name = "fsu2021_spider"
    def start_requests(self):
        urls = ["https://www.espn.com/college-football/team/_/id/52"]
        for url in urls:
            yield SplashRequest(url = url, callback = self.parse)
    def parse(self, response):
        links = response.css('div.global-nav-container li > a::attr(href)')
        link = links.extract()
        for l in link:
            yield{
                'stuff': l
            }

输出:

2021-12-26 13:43:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/college-football/team/_/id/52>
{'stuff': '/college-football/team/_/id/52'}
2021-12-26 13:43:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/college-football/team/_/id/52>
{'stuff': '/college-football/team/schedule/_/id/52'}
2021-12-26 13:43:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/college-football/team/_/id/52>
{'stuff': '/college-football/team/stats/_/id/52'}
2021-12-26 13:43:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/college-football/team/_/id/52>
{'stuff': '/college-football/team/roster/_/id/52'}
2021-12-26 13:43:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/college-football/team/_/id/52>
{'stuff': 'https://fantasy.espn.com/games/college-football-bowl-mania-2021/make-picks?addata=bowlmania2021_ncaaf_web_teamsubnav'}
2021-12-26 13:43:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/college-football/team/_/id/52>
{'stuff': 'https://www.vividseats.com/ncaaf/florida-state-seminoles-tickets.html?wsUser=717&wsVar=us~ncaaf~clubhouse,desktop,en'}

为了快速了解启动画面的设置,我发现这个 article 非常有帮助。