使用 xpath 使用 scrapy 转到下一页
using xpath go to next page with scrapy
我创建了一个蜘蛛来从网站上抓取数据。没关系,直到我添加了一个带有规则的 crawlspider 以使其继续下一页。我猜我在 Rule 中的 xpath 是错误的。你能帮我修一下吗? Ps:我正在使用 python3
这是我的蜘蛛:
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Spider, CrawlSpider, Rule
from scrapy.selector import Selector
from task11.items import Digi
class tutorial(CrawlSpider):
name = "task11"
allowed_domains = ["meetings.intherooms.com"]
start_urls = ["https://meetings.intherooms.com/meetings/aa/al"]
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('(//a[@class="prevNext" and contains(text(),"Next")])[1]')),callback="parse_page", follow=True),)
def parse_page(self, response):
sel = Selector(response)
sites = sel.xpath('//*[@class="all-meetings"]/tr')
items = []
for site in sites[1:]:
item = Digi()
item['meeting_title'] = site.xpath('td/text()').extract()
items.append(item)
return items
这是我在转义第一页后得到的预期结果(并希望从下一页获得更多):
2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['Alabama Avenue & Lauderdale Street',
'SELMA, ',
'TUESDAY',
'7:00 PM',
'Alcoholics Anonymous']}
2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['Alabama Avenue & Lauderdale Street',
'SELMA, ',
'THURSDAY',
'7:00 PM',
'Alcoholics Anonymous']}
2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['Alabama Avenue & Lauderdale Street',
'SELMA, ',
'SUNDAY',
'7:00 PM',
'Alcoholics Anonymous']}
2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['210 Lauderdale Street',
'SELMA, 36703',
'MONDAY',
'6:00 PM',
'Alcoholics Anonymous']}
我会使用“下一页”按钮的 class:
response.xpath('//a[@class="prevNext"]/@href')
结果是 2 个结果。一个用于顶部,一个用于按钮箭头。
但是,当您打开下一页(第 2 页)时,上一页也会获得 link 和 class prevNext。
这不是一个大问题,因为 scrapy 会过滤掉大部分额外的请求。
但是可以使用文本过滤器限制 links:
response.xpath('//a[contains(text(),"Next")]/@href')
或者如果您怀疑 Next 也在其他 link 中,您可以将它们组合起来:
response.xpath('//a[@class="prevNext" and contains(text(),"Next")]/@href')
您需要将其用于 restrict_xpaths
(不是 link 的文本或 href,而是 link 节点本身):
restrict_xpaths='(//a[@class="prevNext" and contains(text(),"Next")])[1]'
我创建了一个蜘蛛来从网站上抓取数据。没关系,直到我添加了一个带有规则的 crawlspider 以使其继续下一页。我猜我在 Rule 中的 xpath 是错误的。你能帮我修一下吗? Ps:我正在使用 python3
这是我的蜘蛛:
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Spider, CrawlSpider, Rule
from scrapy.selector import Selector
from task11.items import Digi
class tutorial(CrawlSpider):
name = "task11"
allowed_domains = ["meetings.intherooms.com"]
start_urls = ["https://meetings.intherooms.com/meetings/aa/al"]
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('(//a[@class="prevNext" and contains(text(),"Next")])[1]')),callback="parse_page", follow=True),)
def parse_page(self, response):
sel = Selector(response)
sites = sel.xpath('//*[@class="all-meetings"]/tr')
items = []
for site in sites[1:]:
item = Digi()
item['meeting_title'] = site.xpath('td/text()').extract()
items.append(item)
return items
这是我在转义第一页后得到的预期结果(并希望从下一页获得更多):
2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['Alabama Avenue & Lauderdale Street',
'SELMA, ',
'TUESDAY',
'7:00 PM',
'Alcoholics Anonymous']}
2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['Alabama Avenue & Lauderdale Street',
'SELMA, ',
'THURSDAY',
'7:00 PM',
'Alcoholics Anonymous']}
2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['Alabama Avenue & Lauderdale Street',
'SELMA, ',
'SUNDAY',
'7:00 PM',
'Alcoholics Anonymous']}
2018-08-30 08:59:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://meetings.intherooms.com/meetings/aa/al>
{'meeting_title': ['210 Lauderdale Street',
'SELMA, 36703',
'MONDAY',
'6:00 PM',
'Alcoholics Anonymous']}
我会使用“下一页”按钮的 class:
response.xpath('//a[@class="prevNext"]/@href')
结果是 2 个结果。一个用于顶部,一个用于按钮箭头。 但是,当您打开下一页(第 2 页)时,上一页也会获得 link 和 class prevNext。 这不是一个大问题,因为 scrapy 会过滤掉大部分额外的请求。 但是可以使用文本过滤器限制 links:
response.xpath('//a[contains(text(),"Next")]/@href')
或者如果您怀疑 Next 也在其他 link 中,您可以将它们组合起来:
response.xpath('//a[@class="prevNext" and contains(text(),"Next")]/@href')
您需要将其用于 restrict_xpaths
(不是 link 的文本或 href,而是 link 节点本身):
restrict_xpaths='(//a[@class="prevNext" and contains(text(),"Next")])[1]'