Lua scrapy 中的脚本
Lua scripts in scrapy
我正在使用 scrapy 1.6 和 splash 3.2 我有:
import scrapy
import random
from scrapy_splash import SplashRequest
from scrapy.utils.response import open_in_browser
from scrapy.linkextractors import LinkExtractor
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:48.0) Gecko/20100101 Firefox/48.0'
class MySpider(scrapy.Spider):
start_urls = ["http://yahoo.com"]
name = 'mytest'
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 2.5},headers={'User-Agent': USER_AGENT,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# from scrapy.http.response.html import HtmlResponse
# ht = HtmlResponse('jj')
# ht.body.replace =response
open_in_browser(response)
return None
我正在通读 https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash,其中他们给出了以下示例:
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.5)
local title = splash:evaljs("document.title")
return {title=title}
end
显然我不能将 Lua 放入我的 python 脚本中。我应该把它放在哪里以及如何访问它以传递给我的初始请求?
您可以像这样将 lua 脚本作为字符串传递:
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.5)
local title = splash:evaljs('document.title')
return {title=title}
end
"""
yield SplashRequest(
url, self.parse, endpoint='render.html',
args={'wait': 2.5, 'lua_source': script},
headers={'User-Agent': USER_AGENT,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
)
查看 scrapy-splash 的文档:https://github.com/scrapy-plugins/scrapy-splash
我正在使用 scrapy 1.6 和 splash 3.2 我有:
import scrapy
import random
from scrapy_splash import SplashRequest
from scrapy.utils.response import open_in_browser
from scrapy.linkextractors import LinkExtractor
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:48.0) Gecko/20100101 Firefox/48.0'
class MySpider(scrapy.Spider):
start_urls = ["http://yahoo.com"]
name = 'mytest'
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 2.5},headers={'User-Agent': USER_AGENT,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# from scrapy.http.response.html import HtmlResponse
# ht = HtmlResponse('jj')
# ht.body.replace =response
open_in_browser(response)
return None
我正在通读 https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash,其中他们给出了以下示例:
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.5)
local title = splash:evaljs("document.title")
return {title=title}
end
显然我不能将 Lua 放入我的 python 脚本中。我应该把它放在哪里以及如何访问它以传递给我的初始请求?
您可以像这样将 lua 脚本作为字符串传递:
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.5)
local title = splash:evaljs('document.title')
return {title=title}
end
"""
yield SplashRequest(
url, self.parse, endpoint='render.html',
args={'wait': 2.5, 'lua_source': script},
headers={'User-Agent': USER_AGENT,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
)
查看 scrapy-splash 的文档:https://github.com/scrapy-plugins/scrapy-splash