scrapy 爬行蜘蛛的问题

Question

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Jptimes3Spider(CrawlSpider):
name = 'jptimes3'
allowed_domains = ['japantimes.co.jp']
start_urls = ['https://www.japantimes.co.jp/']

custom_settings = {
    'DOWNLOAD_DELAY' : 3,
}

rules = (
    Rule(LinkExtractor(restrict_xpaths='//*[@id="page"]'), callback='parse_item', follow=True),
)

def parse_item(self, response):
    yield{
        'category' : response.css('h3 > span.category-column::text').getall(),
        'category2' : response.css('h3.category-column::text').getall(),
        'article title' : response.css('p.article-title::text').getall(),
        'summary' : response.xpath('//*[@id="wrapper"]/section[2]/div[1]/section[4]/div/ul/li[4]/a/article/header/hgroup/p/text()').getall()
        }

我是 scrapy 的新手，这是我的第一个爬虫。我有 2 个问题。首先是它会获取链接但不会抓取任何项目我只是在我的 csv 中获取列 headers。另外我想知道是否有一种方法可以获取相同的数据，例如在同一列中的类别，如果它们有不同的 css/xpaths?

Answer 1

规则和 parse_items 方法中的 xpath 选择是 incorrect.Here 是工作解决方案的示例。

脚本:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Jptimes3Spider(CrawlSpider):


    name = 'jptimes3'
    allowed_domains = ['japantimes.co.jp']
    start_urls = ['https://www.japantimes.co.jp']

    custom_settings = {
        'DOWNLOAD_DELAY': 3,
    }

    rules = (Rule(LinkExtractor(restrict_xpaths='//div[@data-tb-region="Top News"]/a'), callback='parse_item', follow=True),)


    def parse_item(self, response):
        yield {
            'category': response.xpath('//h3[@class="single-post-categories"]/a/text()').get(),
            'article title': ''.join(response.xpath('//h1/text()').getall())
           
        }

scrapy 爬行蜘蛛的问题

Trouble with scrapy crawl spider

python

scrapy