如何用scrapy抓取每个link的所有内容？

Question

我是 scrapy 的新手，我想从这个 website 中提取每个广告的所有内容。所以我尝试了以下方法：

from scrapy.spiders import Spider
from craigslist_sample.items import CraigslistSampleItem

from scrapy.selector import Selector
class MySpider(Spider):
    name = "craig"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/npo"]

    def parse(self, response):
        links = response.selector.xpath(".//*[@id='sortable-results']//ul//li//p")
        for link in links:
            content = link.xpath(".//*[@id='titletextonly']").extract()
            title = link.xpath("a/@href").extract()
            print(title,content)

项目：

# Define here the models for your scraped items

from scrapy.item import Item, Field

class CraigslistSampleItem(Item):
    title = Field()
    link = Field()

然而，当我运行爬虫时我什么也没得到：

$ scrapy crawl --nolog craig
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

因此，我的问题是：如何遍历每个 url，进入每个 link 并抓取内容和标题？哪种方法最好？ .

Answer 1

我正在尝试回答你的问题。

首先，由于您的不正确的 XPath 查询，您得到了空白结果。通过 XPath ".//*[@id='sortable-results']//ul//li//p"，您正确定位了相关的 <p> 节点，但我不喜欢您的查询表达式。但是，我不知道您后面的 XPath 表达式 ".//*[@id='titletextonly']" 和 "a/@href"，它们无法按您的预期找到 link 和标题。可能你的意思是定位title的text和title的hyperlink。如果是，我相信你必须学习Xpath，请从HTML DOM开始。

我确实想指导您如何进行 XPath 查询，因为网上有很多资源。我想提一下 Scrapy XPath 选择器的一些特性：

Scrapy XPath Selector 是标准 XPath 查询的改进包装器。

在标准 XPath 查询中，它 return 是您查询的 DOM 个节点的数组。您可以打开浏览器的开发模式(F12)，使用控制台命令$x(x_exp)进行测试。我强烈建议通过这种方式测试您的 XPath 表达式。它会给你立竿见影的效果，并节省大量时间。如果有时间，熟悉浏览器的网页开发工具，可以让你快速了解网页结构，找到你要找的入口。

而Scrapy response.xpath(x_exp) return是一个Selector object的数组对应实际的XPath查询，其实就是一个SelectorList object。这意味着 XPath 结果由 SelectorsList 表示。 Selector和SelectorListclass都提供了一些有用的函数来操作结果：

extract, return 序列化文档节点列表（到 unicode 字符串）
extract_first、return 标量，extract 结果的 first
re，return 列表，extract 个结果中的 re
re_first、return 标量，re 个结果的 first。

这些功能使您的编程更加方便。例如，您可以直接在 SelectorList object 上调用 xpath 函数。如果您之前尝试过 lxml，您会发现这非常有用：如果您想对前一个 xpath 结果在 lxml 的结果上调用 xpath 函数，您必须迭代以前的结果。另一个例子是，当您确定该列表中至多有一个元素时，您可以使用 extract_first 来获取标量值，而不是使用列表索引方法（例如 rlist[0]），这将当没有元素匹配时导致超出索引异常。请记住，当您解析网页时总会有异常，请小心并稳健地编写程序。

绝对 XPath 与 relative XPath

Keep in mind that if you are nesting XPathSelectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the XPathSelector you’re calling it from.

当你做操作node.xpath(x_expr)时，如果x_expr以/开头，它是一个绝对查询，XPath将从root开始搜索； else 如果 x_expr 以 . 开头，它是一个相对查询。标准中也提到了这一点 2.5 Abbreviated Syntax

. selects the context node

.//para selects the para element descendants of the context node

.. selects the parent of the context node

../@lang selects the lang attribute of the parent of the context node

对于您的申请，您可能需要阅读下一页。在这里，下一页节点很容易定位——有下一页按钮。但是，您还需要注意停止关注的时间。仔细查找您的 URL 查询参数，以了解您的应用程序的 URL 模式。这里，为了确定何时停止关注下一页，您可以将当前项目范围与项目总数进行比较。

新编辑

我对 link 的 内容的含义有点困惑。现在我知道@student 也想抓取 link 来提取 AD 内容。以下是解决方法。

发送请求并附加其解析器

你可能会注意到我使用 Scrapy Request class 来关注下一页。实际上，Request class 的功能远不止于此——您可以通过设置参数 callback.
为每个请求附加所需的解析函数

callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.

在第3步中，我在发送下一页请求时没有设置callback，因为这些请求应该默认由parse函数处理。现在来到指定的广告页面，与之前的广告列表页面不同的页面。因此我们需要定义一个新的页面解析器函数，比方说 parse_ad，当我们发送每个 AD 页面请求时，将这个 parse_ad 函数附加到请求中。

让我们转到适合我的修改后的示例代码：

items.py

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class ScrapydemoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() link = scrapy.Field() class AdItem(scrapy.Item): title = scrapy.Field() description = scrapy.Field()

蜘蛛

# -*- coding: utf-8 -*- from scrapy.spiders import Spider from scrapy.http import Request from scrapydemo.items import ScrapydemoItem from scrapydemo.items import AdItem try: from urllib.parse import urljoin except ImportError: from urlparse import urljoin class MySpider(Spider): name = "demo" allowed_domains = ["craigslist.org"] start_urls = ["http://sfbay.craigslist.org/search/npo"] def parse(self, response): # locate list of each item s_links = response.xpath("//*[@id='sortable-results']/ul/li") # locate next page and extract it next_page = response.xpath( '//a[@title="next page"]/@href').extract_first() next_page = urljoin(response.url, next_page) to = response.xpath( '//span[@class="rangeTo"]/text()').extract_first() total = response.xpath( '//span[@class="totalcount"]/text()').extract_first() # test end of following if int(to) < int(total): # important, send request of next page # default parsing function is 'parse' yield Request(next_page) for s_link in s_links: # locate and extract title = s_link.xpath("./p/a/text()").extract_first().strip() link = s_link.xpath("./p/a/@href").extract_first() link = urljoin(response.url, link) if title is None or link is None: print('Warning: no title or link found: %s', response.url) else: yield ScrapydemoItem(title=title, link=link) # important, send request of ad page # parsing function is 'parse_ad' yield Request(link, callback=self.parse_ad) def parse_ad(self, response): ad_title = response.xpath( '//span[@id="titletextonly"]/text()').extract_first().strip() ad_description = ''.join(response.xpath( '//section[@id="postingbody"]//text()').extract()) if ad_title is not None and ad_description is not None: yield AdItem(title=ad_title, description=ad_description) else: print('Waring: no title or description found %s', response.url)

重点说明

两个解析函数，parse请求AD列表页，parse_ad请求指定AD页

要提取 AD post 的内容，您需要一些技巧。参见 How can I get all the plain text from a website with Scrapy

输出快照：

2016-11-10 21:25:14 [scrapy] DEBUG: Scraped from <200 http://sfbay.craigslist.org/eby/npo/5869108363.html> {'description': '\n' ' \n' ' QR Code Link to This Post\n' ' \n' ' \n' 'Agency History:\n' ........ 'title': 'Staff Accountant'} 2016-11-10 21:25:14 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 39259, 'downloader/request_count': 117, 'downloader/request_method_count/GET': 117, 'downloader/response_bytes': 711320, 'downloader/response_count': 117, 'downloader/response_status_count/200': 117, 'finish_reason': 'shutdown', 'finish_time': datetime.datetime(2016, 11, 11, 2, 25, 14, 878628), 'item_scraped_count': 314, 'log_count/DEBUG': 432, 'log_count/INFO': 8, 'request_depth_max': 2, 'response_received_count': 117, 'scheduler/dequeued': 116, 'scheduler/dequeued/memory': 116, 'scheduler/enqueued': 203, 'scheduler/enqueued/memory': 203, 'start_time': datetime.datetime(2016, 11, 11, 2, 24, 59, 242456)} 2016-11-10 21:25:14 [scrapy] INFO: Spider closed (shutdown)

谢谢。希望这会有所帮助并玩得开心。

Answer 2

要构建基本的 scrapy 项目，您可以使用 command:

scrapy startproject craig

然后添加蜘蛛和物品：

craig/spiders/spider.py

from scrapy import Spider
from craig.items import CraigslistSampleItem
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy import Request
import urlparse, re

class CraigSpider(Spider):
    name = "craig"
    start_url = "https://sfbay.craigslist.org/search/npo"

    def start_requests(self):

        yield Request(self.start_url, callback=self.parse_results_page)


    def parse_results_page(self, response):

        sel = Selector(response)

        # Browse paging.
        page_urls = sel.xpath(""".//span[@class='buttons']/a[@class='button next']/@href""").getall()

        for page_url in page_urls + [response.url]:
            page_url = urlparse.urljoin(self.start_url, page_url)

            # Yield a request for the next page of the list, with callback to this same function: self.parse_results_page().
            yield Request(page_url, callback=self.parse_results_page)

        # Browse items.
        item_urls = sel.xpath(""".//*[@id='sortable-results']//li//a/@href""").getall()

        for item_url in item_urls:
            item_url = urlparse.urljoin(self.start_url, item_url)

            # Yield a request for each item page, with callback self.parse_item().
            yield Request(item_url, callback=self.parse_item)


    def parse_item(self, response):

        sel = Selector(response)

        item = CraigslistSampleItem()

        item['title'] = sel.xpath('//*[@id="titletextonly"]').extract_first()
        item['body'] = sel.xpath('//*[@id="postingbody"]').extract_first()
        item['link'] = response.url

        yield item

craig/items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

from scrapy.item import Item, Field

class CraigslistSampleItem(Item):
    title = Field()
    body = Field()
    link = Field()

craig/settings.py

# -*- coding: utf-8 -*-

BOT_NAME = 'craig'

SPIDER_MODULES = ['craig.spiders']
NEWSPIDER_MODULE = 'craig.spiders'

ITEM_PIPELINES = {
   'craig.pipelines.CraigPipeline': 300,
}

craig/pipelines.py

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exporters import CsvItemExporter

class CraigPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_ads.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

您可以运行蜘蛛运行宁 command:

scrapy runspider craig/spiders/spider.py

从项目的根目录。

它应该在项目的根目录中创建一个 craig_ads.csv。

如何用scrapy抓取每个link的所有内容？

How to scrape all the content of each link with scrapy?

python

web-crawler

scrapy

web-scraping

scrapy-spider