scrapy-splash 脚本有问题。我只得到一个结果,我的抓取工具不解析其他页面
Having problems with a scrapy-splash script. I only get one result and my scraper does not parse other pages
我正在尝试解析来自 javascript 网站的列表。当我 运行 它时,它只在每一列上返回一个条目,然后蜘蛛关闭。我已经设置了我的中间件设置。我不确定出了什么问题。提前致谢!
import scrapy
from scrapy_splash import SplashRequest
class MalrusSpider(scrapy.Spider):
name = 'malrus'
allowed_domains = ['backgroundscreeninginrussia.com']
start_urls = ['http://www.backgroundscreeninginrussia.com/publications/new-citizens-of-malta-since-january-2015-till-december-2017/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='render.html')
def parse(self, response):
russians = response.xpath('//table[@id="tablepress-8"]')
for russian in russians:
yield{'name' : russian.xpath('//*[@class="column-1"]/text()').extract_first(),
'source' : russian.xpath('//*[@class="column-2"]/text()').extract_first()}
script = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.3)
button = splash:select("a[class=paginate_button next] a")
splash:set_viewport_full()
splash:wait(0.1)
button:mouse_click()
splash:wait(1)
return {url = splash:url(),
html = splash:html()}
end"""
yield SplashRequest(url=response.url,
callback=self.parse,
endpoint='execute',
args={'lua_source': script})
您使用的 .extract_first()(现在是 .get())将始终 return 第一个结果。它不是迭代器,因此多次调用它是没有意义的。您应该尝试 .getall() 方法。那将是这样的:
names = response.xpath('//table[@id="tablepress-8"]').xpath('//*[@class="column-1"]/text()').getall()
sources = response.xpath('//table[@id="tablepress-8"]').xpath('//*[@class="column-2"]/text()').getall()
我正在尝试解析来自 javascript 网站的列表。当我 运行 它时,它只在每一列上返回一个条目,然后蜘蛛关闭。我已经设置了我的中间件设置。我不确定出了什么问题。提前致谢!
import scrapy
from scrapy_splash import SplashRequest
class MalrusSpider(scrapy.Spider):
name = 'malrus'
allowed_domains = ['backgroundscreeninginrussia.com']
start_urls = ['http://www.backgroundscreeninginrussia.com/publications/new-citizens-of-malta-since-january-2015-till-december-2017/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='render.html')
def parse(self, response):
russians = response.xpath('//table[@id="tablepress-8"]')
for russian in russians:
yield{'name' : russian.xpath('//*[@class="column-1"]/text()').extract_first(),
'source' : russian.xpath('//*[@class="column-2"]/text()').extract_first()}
script = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.3)
button = splash:select("a[class=paginate_button next] a")
splash:set_viewport_full()
splash:wait(0.1)
button:mouse_click()
splash:wait(1)
return {url = splash:url(),
html = splash:html()}
end"""
yield SplashRequest(url=response.url,
callback=self.parse,
endpoint='execute',
args={'lua_source': script})
您使用的 .extract_first()(现在是 .get())将始终 return 第一个结果。它不是迭代器,因此多次调用它是没有意义的。您应该尝试 .getall() 方法。那将是这样的:
names = response.xpath('//table[@id="tablepress-8"]').xpath('//*[@class="column-1"]/text()').getall()
sources = response.xpath('//table[@id="tablepress-8"]').xpath('//*[@class="column-2"]/text()').getall()