如何使用 Scrapy 正确循环链接?
How to correclty loop links with Scrapy?
我正在使用 Scrapy,在循环 link 时遇到了一些问题。
我正在从一个页面中抓取大部分信息,但指向另一页的信息除外。
每页有 10 篇文章。对于每篇文章,我都必须获得第二页上的摘要。文章与摘要的对应关系为1:1.
这里是我用来抓取数据的div
部分:
<div class="articleEntry">
<div class="tocArticleEntry include-metrics-panel toc-article-tools">
<div class="item-checkbox-container" role="checkbox" aria-checked="false" aria-labelledby="article-d401999e88">
<label tabindex="0" class="checkbox--primary"><input type="checkbox"
name="10.1080/03066150.2021.1956473"><span class="box-btn"></span></label></div><span
class="article-type">Article</span>
<div class="art_title linkable"><a class="ref nowrap" href="/doi/full/10.1080/03066150.2021.1956473"><span
class="hlFld-Title" id="article-d401999e88">Climate change and agrarian struggles: an invitation to
contribute to a <i>JPS</i> Forum</span></a></div>
<div class="tocentryright">
<div class="tocAuthors afterTitle">
<div class="articleEntryAuthor all"><span class="articleEntryAuthorsLinks"><span><a
href="/author/Borras+Jr.%2C+Saturnino+M">Saturnino M. Borras Jr.</a></span>, <span><a
href="/author/Scoones%2C+Ian">Ian Scoones</a></span>, <span><a
href="/author/Baviskar%2C+Amita">Amita Baviskar</a></span>, <span><a
href="/author/Edelman%2C+Marc">Marc Edelman</a></span>, <span><a
href="/author/Peluso%2C+Nancy+Lee">Nancy Lee Peluso</a></span> & <span><a
href="/author/Wolford%2C+Wendy">Wendy Wolford</a></span></span></div>
</div>
<div class="tocPageRange maintextleft">Pages: 1-28</div>
<div class="tocEPubDate"><span class="maintextleft"><strong>Published online:</strong><span class="date"> 06
Aug 2021</span></span></div>
</div>
<div class="sfxLinkButton"></div>
<div class="tocDeliverFormatsLinks"><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
class="ref nowrap full" href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
class="ref nowrap references" href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
class="ref nowrap nocolwiz" target="_blank" title="Opens new window"
href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a class="ref nowrap epub"
href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
href="/servlet/linkout?type=rightslink&url=startPage%3D1%26pageCount%3D28%26author%3DSaturnino%2BM.%2BBorras%2BJr.%252C%2B%252C%2BIan%2BScoones%252C%2Bet%2Bal%26orderBeanReset%3Dtrue%26imprint%3DRoutledge%26volumeNum%3D49%26issueNum%3D1%26contentID%3D10.1080%252F03066150.2021.1956473%26title%3DClimate%2Bchange%2Band%2Bagrarian%2Bstruggles%253A%2Ban%2Binvitation%2Bto%2Bcontribute%2Bto%2Ba%2BJPS%2BForum%26numPages%3D28%26pa%3D%26oa%3DCC-BY-NC-ND%26issn%3D0306-6150%26publisherName%3Dtandfuk%26publication%3DFJPS%26rpt%3Dn%26endPage%3D28%26publicationDate%3D01%252F02%252F2022"
class="rightslink" target="_blank" title="Opens new window">Permissions</a>\xa0</div>
<div class="metrics-panel">
<ul class="altmetric-score true">
<li><span>6049</span> Views</li>
<li><span>0</span> CrossRef citations</li>
<li class="value" data-doi="10.1080/03066150.2021.1956473"><span class="metrics-score">0</span>Altmetric
</li>
</ul>
</div><span class="access-icon oa" role="img" aria-label="Access provided by Open Access"></span><span
class="part-tooltip">Open Access</span>
</div>
</div>
为此,我定义了以下脚本
from cgitb import text
import scrapy
import pandas as pd
class QuotesSpider(scrapy.Spider):
name = "jps"
start_urls = ['https://www.tandfonline.com/toc/fjps20/current']
def parse(self, response):
self.logger.info('hello this is my first spider')
Title = response.xpath("//span[@class='hlFld-Title']").extract()
Authors = response.xpath("//span[@class='articleEntryAuthorsLinks']").extract()
License = response.xpath("//span[@class='part-tooltip']").extract()
abstract_url = response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract()
row_data = zip(Title, Authors, License, abstract_url)
for quote in row_data:
scraped_info = {
# key:value
'Title': quote[0],
'Authors': quote[1],
'License': quote[2],
'Abstract': quote[3]
}
# yield/give the scraped info to scrapy
yield scraped_info
def parse_links(self, response):
for links in response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract():
yield scrapy.Request(links, callback=self.parse_abstract_page)
#yield response.follow(abstract_url, callback=self.parse_abstract_page)
def parse_abstract_page(self, response):
Abstract = response.xpath("//div[@class='hlFld-Abstract']").extract_first()
row_data = zip(Abstract)
for quote in row_data:
scraped_info_abstract = {
# key:value
'Abstract': quote[0]
}
# yield/give the scraped info to scrapy
yield scraped_info_abstract
作者、标题和许可证已正确抓取。对于摘要,我遇到以下错误:
ValueError: Missing scheme in request url: /doi/abs/10.1080/03066150.2021.1956473
为了检查路径是否正确,我从循环中删除了 abstract_url
:
abstract_url = response.xpath('// [@class="tocDeliverFormatsLinks"]/a/@href').extract_first()
self.logger.info('get abstract page url')
yield response.follow(abstract_url, callback=self.parse_abstract)
我能正确的读到第一篇文章对应的摘要,但其他的不行。我认为错误在循环中。
我该如何解决这个问题?
谢谢
文章摘要的 link 似乎是 link 的亲戚(来自例外)。 /doi/abs/10.1080/03066150.2021.1956473
不是以 https://
或 http://
开头。
您应该将此相关 URL 附加到网站的基 URL 上(即如果基 URL 是 "https://www.tandfonline.com"
,您可以
import urllib.parse
link = urllib.parse.urljoin("https://www.tandfonline.com", link)
然后你就会有一个合适的URL资源。
正如@tgrnie 所解释的,URL 是相对的 URL,需要将其转换为绝对的 URL。
Scrapy 对 urljoin
进行了包装,即 response.urljoin()
。不需要额外的进口。 See official docs here.
所以这一行:
yield scrapy.Request(links, callback=self.parse_abstract_page)
可以这样修改:
yield scrapy.Request(response.urljoin(links), callback=self.parse_abstract_page)
另一种方法是使用 response.follow
,就像您在代码中使用的那样:
yield response.follow(abstract_url, callback=self.parse_abstract)
如果要跟随所有链接,请使用 yield from
和 follow_all
,如下例:
yield from response.follow_all(list_of_urls, callback=self.parse_abstract)
yield Request(URL)
和yield response.follow(url)
最大的区别是亲戚URL可以和response.follow
一起工作,而你必须提供一个完整的URL给创建一个 Request
对象。
我正在使用 Scrapy,在循环 link 时遇到了一些问题。
我正在从一个页面中抓取大部分信息,但指向另一页的信息除外。
每页有 10 篇文章。对于每篇文章,我都必须获得第二页上的摘要。文章与摘要的对应关系为1:1.
这里是我用来抓取数据的div
部分:
<div class="articleEntry">
<div class="tocArticleEntry include-metrics-panel toc-article-tools">
<div class="item-checkbox-container" role="checkbox" aria-checked="false" aria-labelledby="article-d401999e88">
<label tabindex="0" class="checkbox--primary"><input type="checkbox"
name="10.1080/03066150.2021.1956473"><span class="box-btn"></span></label></div><span
class="article-type">Article</span>
<div class="art_title linkable"><a class="ref nowrap" href="/doi/full/10.1080/03066150.2021.1956473"><span
class="hlFld-Title" id="article-d401999e88">Climate change and agrarian struggles: an invitation to
contribute to a <i>JPS</i> Forum</span></a></div>
<div class="tocentryright">
<div class="tocAuthors afterTitle">
<div class="articleEntryAuthor all"><span class="articleEntryAuthorsLinks"><span><a
href="/author/Borras+Jr.%2C+Saturnino+M">Saturnino M. Borras Jr.</a></span>, <span><a
href="/author/Scoones%2C+Ian">Ian Scoones</a></span>, <span><a
href="/author/Baviskar%2C+Amita">Amita Baviskar</a></span>, <span><a
href="/author/Edelman%2C+Marc">Marc Edelman</a></span>, <span><a
href="/author/Peluso%2C+Nancy+Lee">Nancy Lee Peluso</a></span> & <span><a
href="/author/Wolford%2C+Wendy">Wendy Wolford</a></span></span></div>
</div>
<div class="tocPageRange maintextleft">Pages: 1-28</div>
<div class="tocEPubDate"><span class="maintextleft"><strong>Published online:</strong><span class="date"> 06
Aug 2021</span></span></div>
</div>
<div class="sfxLinkButton"></div>
<div class="tocDeliverFormatsLinks"><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
class="ref nowrap full" href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
class="ref nowrap references" href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
class="ref nowrap nocolwiz" target="_blank" title="Opens new window"
href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a class="ref nowrap epub"
href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
href="/servlet/linkout?type=rightslink&url=startPage%3D1%26pageCount%3D28%26author%3DSaturnino%2BM.%2BBorras%2BJr.%252C%2B%252C%2BIan%2BScoones%252C%2Bet%2Bal%26orderBeanReset%3Dtrue%26imprint%3DRoutledge%26volumeNum%3D49%26issueNum%3D1%26contentID%3D10.1080%252F03066150.2021.1956473%26title%3DClimate%2Bchange%2Band%2Bagrarian%2Bstruggles%253A%2Ban%2Binvitation%2Bto%2Bcontribute%2Bto%2Ba%2BJPS%2BForum%26numPages%3D28%26pa%3D%26oa%3DCC-BY-NC-ND%26issn%3D0306-6150%26publisherName%3Dtandfuk%26publication%3DFJPS%26rpt%3Dn%26endPage%3D28%26publicationDate%3D01%252F02%252F2022"
class="rightslink" target="_blank" title="Opens new window">Permissions</a>\xa0</div>
<div class="metrics-panel">
<ul class="altmetric-score true">
<li><span>6049</span> Views</li>
<li><span>0</span> CrossRef citations</li>
<li class="value" data-doi="10.1080/03066150.2021.1956473"><span class="metrics-score">0</span>Altmetric
</li>
</ul>
</div><span class="access-icon oa" role="img" aria-label="Access provided by Open Access"></span><span
class="part-tooltip">Open Access</span>
</div>
</div>
为此,我定义了以下脚本
from cgitb import text
import scrapy
import pandas as pd
class QuotesSpider(scrapy.Spider):
name = "jps"
start_urls = ['https://www.tandfonline.com/toc/fjps20/current']
def parse(self, response):
self.logger.info('hello this is my first spider')
Title = response.xpath("//span[@class='hlFld-Title']").extract()
Authors = response.xpath("//span[@class='articleEntryAuthorsLinks']").extract()
License = response.xpath("//span[@class='part-tooltip']").extract()
abstract_url = response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract()
row_data = zip(Title, Authors, License, abstract_url)
for quote in row_data:
scraped_info = {
# key:value
'Title': quote[0],
'Authors': quote[1],
'License': quote[2],
'Abstract': quote[3]
}
# yield/give the scraped info to scrapy
yield scraped_info
def parse_links(self, response):
for links in response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract():
yield scrapy.Request(links, callback=self.parse_abstract_page)
#yield response.follow(abstract_url, callback=self.parse_abstract_page)
def parse_abstract_page(self, response):
Abstract = response.xpath("//div[@class='hlFld-Abstract']").extract_first()
row_data = zip(Abstract)
for quote in row_data:
scraped_info_abstract = {
# key:value
'Abstract': quote[0]
}
# yield/give the scraped info to scrapy
yield scraped_info_abstract
作者、标题和许可证已正确抓取。对于摘要,我遇到以下错误:
ValueError: Missing scheme in request url: /doi/abs/10.1080/03066150.2021.1956473
为了检查路径是否正确,我从循环中删除了 abstract_url
:
abstract_url = response.xpath('// [@class="tocDeliverFormatsLinks"]/a/@href').extract_first()
self.logger.info('get abstract page url')
yield response.follow(abstract_url, callback=self.parse_abstract)
我能正确的读到第一篇文章对应的摘要,但其他的不行。我认为错误在循环中。
我该如何解决这个问题?
谢谢
文章摘要的 link 似乎是 link 的亲戚(来自例外)。 /doi/abs/10.1080/03066150.2021.1956473
不是以 https://
或 http://
开头。
您应该将此相关 URL 附加到网站的基 URL 上(即如果基 URL 是 "https://www.tandfonline.com"
,您可以
import urllib.parse
link = urllib.parse.urljoin("https://www.tandfonline.com", link)
然后你就会有一个合适的URL资源。
正如@tgrnie 所解释的,URL 是相对的 URL,需要将其转换为绝对的 URL。
Scrapy 对 urljoin
进行了包装,即 response.urljoin()
。不需要额外的进口。 See official docs here.
所以这一行:
yield scrapy.Request(links, callback=self.parse_abstract_page)
可以这样修改:
yield scrapy.Request(response.urljoin(links), callback=self.parse_abstract_page)
另一种方法是使用 response.follow
,就像您在代码中使用的那样:
yield response.follow(abstract_url, callback=self.parse_abstract)
如果要跟随所有链接,请使用 yield from
和 follow_all
,如下例:
yield from response.follow_all(list_of_urls, callback=self.parse_abstract)
yield Request(URL)
和yield response.follow(url)
最大的区别是亲戚URL可以和response.follow
一起工作,而你必须提供一个完整的URL给创建一个 Request
对象。