Scrapy：如何访问多个子页面并提取所有文本？

Question

我想执行三个应该适用于大多数页面的简单任务。

获取主页上的所有链接https://www.stadt-koeln.de/politik-und-verwaltung/stadtentwicklung/
访问提取的子页面（例如https://www.stadt-koeln.de/politik-und-verwaltung/stadtentwicklung/heliosgelaende）
只需获取在子页面上找到的所有文本

我的做法是：

import scrapy

class StadtKoelnSpider(scrapy.Spider):
    name = "stadt_koeln"

    def start_requests(self):
        urls = ['http://www.stadt-koeln.de/politik-und-verwaltung/stadtentwicklung/']

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # 1. Get all links 
        for url in response.xpath('body//a/@href').getall():

            # 2. Visit each subpage 
            yield scrapy.Request(url.get(), callback=self.parse_subpage)

    def parse_subpage(self, response):
        # 3. Get text on each subpage 
        text = response.xpath("//p/text()").extract()
       
        yield {
            'Subpage_Text': text
        }

没有生成输出。知道如何使这项工作吗？有几种方法可以在 Scrapy 中跟踪链接，但我没有找到适用于我的案例的示例。

我也遇到了错误。我想这意味着只提取了 url 的一部分，而不是完整的。

AttributeError: 'str' object has no attribute 'get'

Answer 1

您可以使用 scrapy crawl 蜘蛛。请参见下面的示例。请注意，这将 return 页面上所有元素中的所有文本

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class StadtKoelnSpider(CrawlSpider):
    name = 'stadt_koeln'
    allowed_domains = ['www.stadt-koeln.de']
    start_urls = ['https://www.stadt-koeln.de/politik-und-verwaltung/stadtentwicklung/']

    rules = (
        Rule(LinkExtractor(allow=r"politik-und-verwaltung\/stadtentwicklung"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        all_text = response.xpath("//*/text()").getall()
        yield {
            "Subpage_Text": " ".join(all_text)
        }

Scrapy：如何访问多个子页面并提取所有文本？

Scrapy: How to visit several subpages and extract all the text?

python

xpath

scrapy