使用 Scrapy 和 Splash 抓取 javascript 内容
Scraping javascript content using Scrapy and Splash
我正在使用 scrapy 和 splash 抓取这个 link:job search
但是我无法提取数据。
我的代码:
import scrapy
from scrapy_splash import SplashRequest
class ManuPySpider(scrapy.Spider):
name = 'manulife'
def start_requests(self):
yield SplashRequest(
url = 'https://manulife.taleo.net/careersection/external_global/jobsearch.ftl?lang=en&location=1038',
callback=self.parse,
)
def parse(self, response):
yield{
'demo' : response.css('div.absolute > span > a::text').extract()
}
Setting.py :
BOT_NAME = 'manulife'
SPIDER_MODULES = ['manulife.spiders']
NEWSPIDER_MODULE = 'manulife.spiders'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':
810,
}
SPLASH_URL = 'http://192.168.99.100:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
我已经检查了启动画面并且 运行。它可能是什么问题。
谢谢
当我尝试通过 Splash 控制台(在 8050
端口)使用默认设置呈现页面时,它不包含所需的数据(即搜索结果 table 为空)。但是一旦我增加 wait
参数,它就起作用了。所以尝试增加参数:
yield SplashRequest(
url = 'https://manulife.taleo.net/careersection/external_global/jobsearch.ftl?lang=en&location=1038',
callback=self.parse, args={'wait': 5}
)
我正在使用 scrapy 和 splash 抓取这个 link:job search
但是我无法提取数据。
我的代码:
import scrapy
from scrapy_splash import SplashRequest
class ManuPySpider(scrapy.Spider):
name = 'manulife'
def start_requests(self):
yield SplashRequest(
url = 'https://manulife.taleo.net/careersection/external_global/jobsearch.ftl?lang=en&location=1038',
callback=self.parse,
)
def parse(self, response):
yield{
'demo' : response.css('div.absolute > span > a::text').extract()
}
Setting.py :
BOT_NAME = 'manulife'
SPIDER_MODULES = ['manulife.spiders']
NEWSPIDER_MODULE = 'manulife.spiders'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':
810,
}
SPLASH_URL = 'http://192.168.99.100:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
我已经检查了启动画面并且 运行。它可能是什么问题。
谢谢
当我尝试通过 Splash 控制台(在 8050
端口)使用默认设置呈现页面时,它不包含所需的数据(即搜索结果 table 为空)。但是一旦我增加 wait
参数,它就起作用了。所以尝试增加参数:
yield SplashRequest(
url = 'https://manulife.taleo.net/careersection/external_global/jobsearch.ftl?lang=en&location=1038',
callback=self.parse, args={'wait': 5}
)