Scrapy 关注 link
Scrapy Follow link
我有以下 CrawlSpider,我无法访问大学网站上的链接。我认为这是因为不稳定的标记,但我不确定。我试图添加一个规则,但它不会遵循。我怎样才能使这项工作?
它像一个单页蜘蛛一样工作,可以抓取第 1 页,但不会跟踪链接。
注意,不是作业,只是我在玩弄,得到了抓取Dmoz的板子。感谢所有帮助。
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from example.items import ExampleItem
class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = ["example.ac.uk"]
start_urls = (
'http://www.example.ac.uk/courses/course-finder?query=&f.Year_of_entry|E=2015/16&f.Type|D=Undergraduate',
''
)
rules = (Rule (SgmlLinkExtractor(allow=("index\.php", ), callback="parse"),))
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@id="course_list"]')
items = []
for site in sites:
item = ExampleItem()
item['link'] = site.xpath('//h2/a/@href').extract()
item['name'] = site.xpath('//h2/a/text()').extract()
items.append(item)
return items
网站分页标记如下:
<div class="pagination">
<ul>
<li><i class="fa fa-chevron-left"></i><span>Previous</span></li>
<li><span>Go to page</span> 1</li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=11"><span>Go to page</span> 2</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=21"><span>Go to page</span> 3</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=31"><span>Go to page</span> 4</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=41"><span>Go to page</span> 5</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=51"><span>Go to page</span> 6</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=61"><span>Go to page</span> 7</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=71"><span>Go to page</span> 8</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=81"><span>Go to page</span> 9</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=91"><span>Go to page</span> 10</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=11"><i class="fa fa-chevron-right"></i><span>Next</span></a></li>
</ul>
</div>
至少您遇到的第一个问题是您在 link 提取器中定义 callback
,但应该在规则级别定义:
rules = (
Rule(LinkExtractor(allow=("index\.php", )), callback="parse_result"),
)
def parse_result(self, response):
...
此外,您需要一个单独的规则来遵循分页:
rules = (
Rule(LinkExtractor(allow=("index\.php", )), callback="parse_result"),
Rule(LinkExtractor(restrict_xpaths='//div[@class="pagination"]'), follow=True),
)
我有以下 CrawlSpider,我无法访问大学网站上的链接。我认为这是因为不稳定的标记,但我不确定。我试图添加一个规则,但它不会遵循。我怎样才能使这项工作?
它像一个单页蜘蛛一样工作,可以抓取第 1 页,但不会跟踪链接。
注意,不是作业,只是我在玩弄,得到了抓取Dmoz的板子。感谢所有帮助。
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from example.items import ExampleItem
class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = ["example.ac.uk"]
start_urls = (
'http://www.example.ac.uk/courses/course-finder?query=&f.Year_of_entry|E=2015/16&f.Type|D=Undergraduate',
''
)
rules = (Rule (SgmlLinkExtractor(allow=("index\.php", ), callback="parse"),))
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@id="course_list"]')
items = []
for site in sites:
item = ExampleItem()
item['link'] = site.xpath('//h2/a/@href').extract()
item['name'] = site.xpath('//h2/a/text()').extract()
items.append(item)
return items
网站分页标记如下:
<div class="pagination">
<ul>
<li><i class="fa fa-chevron-left"></i><span>Previous</span></li>
<li><span>Go to page</span> 1</li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=11"><span>Go to page</span> 2</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=21"><span>Go to page</span> 3</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=31"><span>Go to page</span> 4</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=41"><span>Go to page</span> 5</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=51"><span>Go to page</span> 6</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=61"><span>Go to page</span> 7</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=71"><span>Go to page</span> 8</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=81"><span>Go to page</span> 9</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=91"><span>Go to page</span> 10</a></li>
<li><a href="course-finder?query=&fYear_of_entryE=2015/16&fTypeD=Undergraduate&start_rank=11"><i class="fa fa-chevron-right"></i><span>Next</span></a></li>
</ul>
</div>
至少您遇到的第一个问题是您在 link 提取器中定义 callback
,但应该在规则级别定义:
rules = (
Rule(LinkExtractor(allow=("index\.php", )), callback="parse_result"),
)
def parse_result(self, response):
...
此外,您需要一个单独的规则来遵循分页:
rules = (
Rule(LinkExtractor(allow=("index\.php", )), callback="parse_result"),
Rule(LinkExtractor(restrict_xpaths='//div[@class="pagination"]'), follow=True),
)