Scrapy

Question

我试图从这个网站上抓取标题 (https://minerals.usgs.gov/science/mineral-deposit-database/#products)。我正在使用爬行蜘蛛，因为我打算稍后从页面中的每个 url 获取更多信息！

但出现类型错误：'Rule' object 不可迭代！这是我使用的代码：

import scrapy
import datetime
import socket
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from usgs.items import MineralItem
from scrapy.loader import ItemLoader


class MineralSpider(CrawlSpider):
    name = 'mineral'
    allowed_domains = ['web']
    start_urls = 'https://minerals.usgs.gov/science/mineral-deposit- 
    database/#products'

    rules = (
        Rule(LinkExtractor(
            restrict_xpaths='//*[@id="products"][1]/p/a'),
            callback='parse')
    )

    def parse(self, response):
        it = ItemLoader(item=MineralItem(), response=response)
        it.add_xpath('name', '//*[@class="container"]/header/h1/text()')
        it.add_value('url', response.url)
        it.add_value('project', self.settings.get('BOT_NAME'))
        it.add_value('spider', self.name)
        it.add_value('server', socket.gethostname())
        it.add_value('date', datetime.datetime.now())
        return it.load_item()

日志消息：

(base) C:\Users\User\Documents\Python WebCrawling Learing 
Projects\usgs\usgs\spiders>scrapy crawl mineral
2018-11-16 17:43:03 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: 
usgs)
2018-11-16 17:43:03 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 
2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.7.0, Python 
3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], 
pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1, Platform 
Windows-10-10.0.17134-SP0
2018-11-16 17:43:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 
'usgs', 'NEWSPIDER_MODULE': 'usgs.spiders', 'ROBOTSTXT_OBEY': True, 
'SPIDER_MODULES': ['usgs.spiders']}
2018-11-16 17:43:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
2018-11-16 17:43:03 [twisted] CRITICAL: Unhandled error in Deferred:

2018-11-16 17:43:03 [twisted] CRITICAL:
Traceback (most recent call last):
File "C:\Users\User\Anaconda3\lib\site- 
packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\crawler.py", line 
79, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\crawler.py", line 
102, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py", 
line 100, in from_crawler
spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site- 
packages\scrapy\spiders\__init__.py", line 51, in from_crawler
spider = cls(*args, **kwargs)
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py", 
line 40, in __init__
self._compile_rules()
File "C:\Users\User\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py", 
line 92, in _compile_rules
self._rules = [copy.copy(r) for r in self.rules]
TypeError: 'Rule' object is not iterable

有什么想法吗？

Answer 1

在您的 Rule 对象后添加一个逗号，以便它认为它是一个有效的元组：

rules = (
        Rule(LinkExtractor(
            restrict_xpaths='//*[@id="products"][1]/p/a'),
            callback='parse'),
)

您可能还想看看这个答案：Why does adding a trailing comma after a variable name make it a tuple?

Scrapy - TypeError: 'Rule' object is not iterable

Scrapy - TypeError: 'Rule' object is not iterable

typeerror

web-scraping

python-3.x

scrapy-spider