如何使用 Scrapy 正确循环链接？

Question

我正在使用 Scrapy，在循环 link 时遇到了一些问题。

我正在从一个页面中抓取大部分信息，但指向另一页的信息除外。

每页有 10 篇文章。对于每篇文章，我都必须获得第二页上的摘要。文章与摘要的对应关系为1:1.

这里是我用来抓取数据的div部分：

<div class="articleEntry">
    <div class="tocArticleEntry include-metrics-panel toc-article-tools">
        <div class="item-checkbox-container" role="checkbox" aria-checked="false" aria-labelledby="article-d401999e88">
            <label tabindex="0" class="checkbox--primary"><input type="checkbox"
                    name="10.1080/03066150.2021.1956473"><span class="box-btn"></span></label></div><span
            class="article-type">Article</span>
        <div class="art_title linkable"><a class="ref nowrap" href="/doi/full/10.1080/03066150.2021.1956473"><span
                    class="hlFld-Title" id="article-d401999e88">Climate change and agrarian struggles: an invitation to
                    contribute to a <i>JPS</i> Forum</span></a></div>
        <div class="tocentryright">
            <div class="tocAuthors afterTitle">
                <div class="articleEntryAuthor all"><span class="articleEntryAuthorsLinks"><span><a
                                href="/author/Borras+Jr.%2C+Saturnino+M">Saturnino M. Borras Jr.</a></span>, <span><a
                                href="/author/Scoones%2C+Ian">Ian Scoones</a></span>, <span><a
                                href="/author/Baviskar%2C+Amita">Amita Baviskar</a></span>, <span><a
                                href="/author/Edelman%2C+Marc">Marc Edelman</a></span>, <span><a
                                href="/author/Peluso%2C+Nancy+Lee">Nancy Lee Peluso</a></span> &amp; <span><a
                                href="/author/Wolford%2C+Wendy">Wendy Wolford</a></span></span></div>
            </div>
            <div class="tocPageRange maintextleft">Pages: 1-28</div>
            <div class="tocEPubDate"><span class="maintextleft"><strong>Published online:</strong><span class="date"> 06
                        Aug 2021</span></span></div>
        </div>
        <div class="sfxLinkButton"></div>
        <div class="tocDeliverFormatsLinks"><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
                class="ref nowrap full" href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
                class="ref nowrap references" href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
                class="ref nowrap nocolwiz" target="_blank" title="Opens new window"
                href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a class="ref nowrap epub"
                href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
                href="/servlet/linkout?type=rightslink&amp;url=startPage%3D1%26pageCount%3D28%26author%3DSaturnino%2BM.%2BBorras%2BJr.%252C%2B%252C%2BIan%2BScoones%252C%2Bet%2Bal%26orderBeanReset%3Dtrue%26imprint%3DRoutledge%26volumeNum%3D49%26issueNum%3D1%26contentID%3D10.1080%252F03066150.2021.1956473%26title%3DClimate%2Bchange%2Band%2Bagrarian%2Bstruggles%253A%2Ban%2Binvitation%2Bto%2Bcontribute%2Bto%2Ba%2BJPS%2BForum%26numPages%3D28%26pa%3D%26oa%3DCC-BY-NC-ND%26issn%3D0306-6150%26publisherName%3Dtandfuk%26publication%3DFJPS%26rpt%3Dn%26endPage%3D28%26publicationDate%3D01%252F02%252F2022"
                class="rightslink" target="_blank" title="Opens new window">Permissions</a>\xa0</div>
        <div class="metrics-panel">
            <ul class="altmetric-score true">
                <li><span>6049</span> Views</li>
                <li><span>0</span> CrossRef citations</li>
                <li class="value" data-doi="10.1080/03066150.2021.1956473"><span class="metrics-score">0</span>Altmetric
                </li>
            </ul>
        </div><span class="access-icon oa" role="img" aria-label="Access provided by Open Access"></span><span
            class="part-tooltip">Open Access</span>
    </div>
</div>

为此，我定义了以下脚本

from cgitb import text
import scrapy
import pandas as pd


class QuotesSpider(scrapy.Spider):
    name = "jps"

    start_urls = ['https://www.tandfonline.com/toc/fjps20/current']
    

    def parse(self, response):
        self.logger.info('hello this is my first spider')
        Title = response.xpath("//span[@class='hlFld-Title']").extract()
        Authors = response.xpath("//span[@class='articleEntryAuthorsLinks']").extract()
        License = response.xpath("//span[@class='part-tooltip']").extract()
        abstract_url = response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract()
        row_data = zip(Title, Authors, License, abstract_url)
        
        for quote in row_data:
            scraped_info = {
                # key:value
                'Title': quote[0],
                'Authors': quote[1],
                'License': quote[2],
                'Abstract': quote[3]
            }
            # yield/give the scraped info to scrapy
            yield scraped_info
    
    
    def parse_links(self, response):
        
        for links in response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract():
            yield scrapy.Request(links, callback=self.parse_abstract_page)
        #yield response.follow(abstract_url, callback=self.parse_abstract_page)
    
    def parse_abstract_page(self, response):
        Abstract = response.xpath("//div[@class='hlFld-Abstract']").extract_first()
        row_data = zip(Abstract)
        for quote in row_data:
            scraped_info_abstract = {
                # key:value
                'Abstract': quote[0]
            }
            # yield/give the scraped info to scrapy
            yield scraped_info_abstract

作者、标题和许可证已正确抓取。对于摘要，我遇到以下错误：

ValueError: Missing scheme in request url: /doi/abs/10.1080/03066150.2021.1956473

为了检查路径是否正确，我从循环中删除了 abstract_url：

 abstract_url = response.xpath('// [@class="tocDeliverFormatsLinks"]/a/@href').extract_first()
 self.logger.info('get abstract page url')
 yield response.follow(abstract_url, callback=self.parse_abstract)

我能正确的读到第一篇文章对应的摘要，但其他的不行。我认为错误在循环中。

我该如何解决这个问题？

谢谢

Answer 1

文章摘要的 link 似乎是 link 的亲戚（来自例外）。 /doi/abs/10.1080/03066150.2021.1956473 不是以 https:// 或 http:// 开头。

您应该将此相关 URL 附加到网站的基 URL 上（即如果基 URL 是 "https://www.tandfonline.com"，您可以

import urllib.parse

link = urllib.parse.urljoin("https://www.tandfonline.com", link)

然后你就会有一个合适的URL资源。

Answer 2

正如@tgrnie 所解释的，URL 是相对的 URL，需要将其转换为绝对的 URL。

Scrapy 对 urljoin 进行了包装，即 response.urljoin()。不需要额外的进口。 See official docs here.

所以这一行：

yield scrapy.Request(links, callback=self.parse_abstract_page)

可以这样修改：

yield scrapy.Request(response.urljoin(links), callback=self.parse_abstract_page)

另一种方法是使用 response.follow，就像您在代码中使用的那样：

 yield response.follow(abstract_url, callback=self.parse_abstract)

如果要跟随所有链接，请使用 yield from 和 follow_all，如下例：

yield from response.follow_all(list_of_urls, callback=self.parse_abstract)

yield Request(URL)和yield response.follow(url)最大的区别是亲戚URL可以和response.follow一起工作，而你必须提供一个完整的URL给创建一个 Request 对象。

参见documentation here。

如何使用 Scrapy 正确循环链接？

How to correclty loop links with Scrapy?

python

scrapy