Scrapy CrawlSpider 遍历整个网站
Scrapy CrawlSpider iterating through entire site
我有一个简单的 CrawlSpider,它可以抓取特定网站的首页。我想让蜘蛛继续执行 ?p=1、?p=2 等等,直到它检测到站点迭代结束。我该怎么做?
class PomosCrawlSpider(CrawlSpider):
name = 'crawlobituaries'
allowed_domains = ['some.at']
start_urls = [
'https://www.some.at',
]
rules = (
Rule(LinkExtractor(allow='traueranzeigen'), callback='parse_obi'),
)
def parse_obi(self, response):
logging.info(response.url)
for post in response.css('.homepage_unterseiten_layout_todesanzeigen'):
for entry in post.css('a'):
item = {
'name': entry.css('.homepage_unterseiten_layout_titel::text').get(),
'date': entry.css('.homepage_unterseiten_layout_datum::text').get()
}
yield item
你的蜘蛛只抓取第一页的原因是你没有在你的 Rule
定义中添加 follow=True
以便蜘蛛跟踪链接并提取更多链接。您还需要添加一个 Rule
定义以跟随可以使用 restrict_css
方法定义的下一页,并在导航 div 中包含 class。请参阅下面的示例代码。
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import logging
class PomosCrawlSpider(CrawlSpider):
name = 'crawlobituaries'
allowed_domains = ['bestattung-aichinger.at']
start_urls = [
'https://www.bestattung-aichinger.at',
]
rules = (
Rule(LinkExtractor(restrict_text='Traueranzeigen'), callback='parse_obi', follow=True),
Rule(LinkExtractor(restrict_css=".seitenzahlen"), callback='parse_obi', follow=True),
)
def parse_obi(self, response):
logging.info(response.url)
for post in response.css('.homepage_unterseiten_layout_todesanzeigen'):
for entry in post.css('a'):
item = {
'name': entry.css('.homepage_unterseiten_layout_titel::text').get(),
'date': entry.css('.homepage_unterseiten_layout_datum::text').get()
}
yield item
我有一个简单的 CrawlSpider,它可以抓取特定网站的首页。我想让蜘蛛继续执行 ?p=1、?p=2 等等,直到它检测到站点迭代结束。我该怎么做?
class PomosCrawlSpider(CrawlSpider):
name = 'crawlobituaries'
allowed_domains = ['some.at']
start_urls = [
'https://www.some.at',
]
rules = (
Rule(LinkExtractor(allow='traueranzeigen'), callback='parse_obi'),
)
def parse_obi(self, response):
logging.info(response.url)
for post in response.css('.homepage_unterseiten_layout_todesanzeigen'):
for entry in post.css('a'):
item = {
'name': entry.css('.homepage_unterseiten_layout_titel::text').get(),
'date': entry.css('.homepage_unterseiten_layout_datum::text').get()
}
yield item
你的蜘蛛只抓取第一页的原因是你没有在你的 Rule
定义中添加 follow=True
以便蜘蛛跟踪链接并提取更多链接。您还需要添加一个 Rule
定义以跟随可以使用 restrict_css
方法定义的下一页,并在导航 div 中包含 class。请参阅下面的示例代码。
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import logging
class PomosCrawlSpider(CrawlSpider):
name = 'crawlobituaries'
allowed_domains = ['bestattung-aichinger.at']
start_urls = [
'https://www.bestattung-aichinger.at',
]
rules = (
Rule(LinkExtractor(restrict_text='Traueranzeigen'), callback='parse_obi', follow=True),
Rule(LinkExtractor(restrict_css=".seitenzahlen"), callback='parse_obi', follow=True),
)
def parse_obi(self, response):
logging.info(response.url)
for post in response.css('.homepage_unterseiten_layout_todesanzeigen'):
for entry in post.css('a'):
item = {
'name': entry.css('.homepage_unterseiten_layout_titel::text').get(),
'date': entry.css('.homepage_unterseiten_layout_datum::text').get()
}
yield item