如何使用Scrapy和Splash获取动态页面的html?
How to get the html of a dynamic page using Scrapy and Splash?
我想抓取以下网站:
https://dimsum.eu-gb.containers.appdomain.cloud/
但是,源只是一个脚本:
<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width,initial-scale=1"><link href="https://fonts.googleapis.com/css?family=IBM+Plex+Sans" rel="stylesheet"><link rel="icon" type="image/png" href="/favicon.png"><title>IBM Science Summarizer</title><style>#teconsent {
bottom: 120px !important;
}</style><link href="/css/article.91dc9a3f.css" rel="prefetch"><link href="/css/faq.415c1d74.css" rel="prefetch"><link href="/css/search.4bc6e428.css" rel="prefetch"><link href="/js/article.8fdbbb61.js" rel="prefetch"><link href="/js/faq.6fba764e.js" rel="prefetch"><link href="/js/search.cdc7df37.js" rel="prefetch"><link href="/css/app.de6343fa.css" rel="preload" as="style"><link href="/css/chunk-vendors.9096ae02.css" rel="preload" as="style"><link href="/js/app.d95ff0b2.js" rel="preload" as="script"><link href="/js/chunk-vendors.29fc9656.js" rel="preload" as="script"><link href="/css/chunk-vendors.9096ae02.css" rel="stylesheet"><link href="/css/app.de6343fa.css" rel="stylesheet"></head><body><noscript><strong>We're sorry but Scholar doesn't work properly without JavaScript enabled. Please enable it to continue.</strong></noscript><div id="app"></div><script>// window.webpackHotUpdate is present in local development mode
if (!window.webpackHotUpdate) {
var head = document.getElementsByTagName('head')[0]
var script = document.createElement('script')
script.type = 'text/javascript'
script.src = 'https://www.ibm.com/common/stats/ida_stats.js'
head.appendChild(script)
}</script><script src="/js/chunk-vendors.29fc9656.js"></script><script src="/js/app.d95ff0b2.js"></script></body></html>
首先我想通过网站中的表格进行搜索,但是Scrapy
找不到表格。所以我使用 scrapy-spash
但它仍然找不到任何形式:
class IBMSSSpider(scrapy.Spider):
""" A spider to collect articles from IBM SS website """
name = 'ibmss'
start_urls = [
'https://dimsum.eu-gb.containers.appdomain.cloud/' # search?query=reading%20comprehension',
# 'http://google.com'
]
def start_requests(self):
print("start_urls:", self.start_urls)
for url in self.start_urls:
yield SplashRequest(url, self.parse,
args={
# optional; parameters passed to Splash HTTP API
'wait': 0.5,
'url': url,
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
#endpoint='render.json', # optional; default is render.html
#splash_url='<url>', # optional; overrides SPLASH_URL
#slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional
)
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'__BVID__377': 'reading comprehension' },
formxpath='//*[@id="app"]/div[2]/div[1]/div/div[2]/div/div[1]/form',
callback=self.parse_results
)
def parse_results(self, response):
RESULT_LIST = '//*[@id="app"]/div[2]/div[2]/div/div/div/div'
# RESULT_LIST = '//*div[contains(@class, "search-results")]//div[contains(@class, "result")]'
result_listing = response.xpath(RESULT_LIST)
pub_item = PaperItem(pub_type='archive')
for result in result_listing:
pub_url = response.urljoin(result.xpath('.//div[contains(@class, "result-title")]/a/@href').extract_first())
print(pub_url)
yield scrapy.Request(pub_url, callback=self.parse_paper_details,
meta={'result': pub_item})
考虑到网站的来源,请指导我如何运行抓取它?
所以免责声明我无法用 Scrapy 做到这一点。
在 Scrapy 中抓取动态内容
我不确定您需要哪些关于文章的信息,但在抓取动态内容驱动的网站时需要考虑以下几点。
- javascript驱动的网站多少钱?
- 是否有 API 我可以 re-engineering HTTP 请求而不是自动化浏览器 activity?
2.1) 如果是这样,我是否需要 headers、参数和 cookie 来模拟该请求?
- Pre-rendering 带有启动画面的页面
- 使用 selenium 和 scrapy 的最后手段
- 直接在脚本中使用 selenium 模块。
按此顺序进行的原因是,每一个作为潜在的解决方案都会增加刮板变脆的可能性,并且刮板的效率会越来越慢。
最有效的解决方案是寻找 API。
这个网站
检查该网站,您可以看到它完全由 javascript 驱动,这增加了它向 API end-point 发出 AJAX 请求的可能性。使用 chrome 开发工具,您可以看到对 API https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search
发出了 5 个请求
我经常使用请求包到 fiddle 周围,首先是 API 端点。因此,在这样做的过程中,我发现它实际上只需要 headers 和您的查询。我假设您正在将阅读理解视为一种搜索,所以我将其用作示例。
我将对在网络工具中找到的请求进行 CURL 复制,并将其复制到 curl.trillworks.com
,从而将 headers 等转换为良好的格式。
出于某种原因,绝对有必要将数据字符串中的 null 传递给此 API。然而,在 python 中传递字典没有 null 等价物,这是能够在 Scrapy 中传递参数的方式(使用 meta 或 cb_kwargs)。我很想看到其他人正在努力让它在 Scrapy 中工作。我可能遗漏了一些关于在请求中传递参数的内容。
代码示例
import requests
headers = {
'Connection': 'keep-alive',
'Accept': 'application/json',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
'Content-Type': 'application/json',
'Origin': 'https://dimsum.eu-gb.containers.appdomain.cloud',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://dimsum.eu-gb.containers.appdomain.cloud/',
'Accept-Language': 'en-US,en;q=0.9',
}
data = '{"query":"reading comprehension","filters":{},"page":0,"size":10,"sort":null,"sessionInfo":""}'
response = requests.post('https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search', headers=headers, data=data)
articles = response.json()['searchResults']['results']
for a in articles:
for b in a['sections']:
title = b['title']
print(title)
print('----------')
for c in b['fragments']:
text= c['text']
print(text)
我们在这里循环显示该页面上的每篇搜索结果文章,每个部分都有一个标题,我们正在循环并打印该标题,然后在该部分中有包含该页面上所有文本的片段。然后我们正在打印它。
同样,我不知道您对这些信息做了什么,所以我无法进一步说明,但您应该能够从中存储您需要的文本。
我必须敦促您自己查看 json object 如果有额外的数据需要,您只需要做一些 json 搜索就可以了。如果你想要 link 到 ARVIX pdf 那么它也在那里。
评论更新
这里是您需要实现的代码示例,才能使它与 scrapy 一起工作。
import scrapy
import json
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['dimsum.eu-gb.containers.appdomain.cloud/']
headers = {
'Connection': 'keep-alive',
'Accept': 'application/json',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
'Content-Type': 'application/json',
'Origin': 'https://dimsum.eu-gb.containers.appdomain.cloud',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://dimsum.eu-gb.containers.appdomain.cloud/',
'Accept-Language': 'en-US,en;q=0.9',
}
cookies = {
'dimsum_user': 'dce0087b-b1ed-4ceb-861a-6dcdc1af500f',
'JSESSIONID': 'node01i38ra486o3eocapxvtryared1263001.node0',
}
data = {"query":"reading comprehension","filters":{},"page":0,"size":10,"sort":null,"sessionInfo":""}
def start_requests(self):
api_url = 'https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search'
yield scrapy.Request(url=api_url,method='POST',headers=self.headers,cb_kwargs={'data':self.data},cookies=self.cookies, callback=self.parse)
def parse(self, response):
articles = response.json()['searchResults']['results']
for a in articles:
for b in a['sections']:
title = b['title']
print(title)
print('----------')
for c in b['fragments']:
text= c['text']
print(text)
问题
Null 不是 python 中的关键字,因此不能用作字典,不幸的是,"sort":null
必须用作我放入数据中的参数variable
。我也尝试过将其转换为 JSON 字符串,但没有成功。
你得到的错误是
data = {"query":"reading comprehension","filters{},"page":0,"size":10,"sort":null,"sessionInfo":""}
NameError: name 'null' is not defined`NameError: name 'null' is not defined
基本的 scrapy 日志
2020-07-30 13:10:10 [scrapy.core.engine] INFO: Spider opened
2020-07-30 13:10:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-30 13:10:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-30 13:10:10 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search> (failed 1 times): 500 Internal Server Error
2020-07-30 13:10:10 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search> (failed 2 times): 500 Internal Server Error
2020-07-30 13:10:10 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <POST https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search> (failed 3 times): 500 Internal Server Error
2020-07-30 13:10:10 [scrapy.core.engine] DEBUG: Crawled (500) <POST https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search> (referer: https://dimsum.eu-gb.containers.appdomain.cloud/)
2020-07-30 13:10:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search>: HTTP status code is not handled or not allowed
2020-07-30 13:10:10 [scrapy.core.engine] INFO: Closing spider (finished)
愿意接受有关此问题的想法和建议。
我想抓取以下网站:
https://dimsum.eu-gb.containers.appdomain.cloud/
但是,源只是一个脚本:
<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width,initial-scale=1"><link href="https://fonts.googleapis.com/css?family=IBM+Plex+Sans" rel="stylesheet"><link rel="icon" type="image/png" href="/favicon.png"><title>IBM Science Summarizer</title><style>#teconsent {
bottom: 120px !important;
}</style><link href="/css/article.91dc9a3f.css" rel="prefetch"><link href="/css/faq.415c1d74.css" rel="prefetch"><link href="/css/search.4bc6e428.css" rel="prefetch"><link href="/js/article.8fdbbb61.js" rel="prefetch"><link href="/js/faq.6fba764e.js" rel="prefetch"><link href="/js/search.cdc7df37.js" rel="prefetch"><link href="/css/app.de6343fa.css" rel="preload" as="style"><link href="/css/chunk-vendors.9096ae02.css" rel="preload" as="style"><link href="/js/app.d95ff0b2.js" rel="preload" as="script"><link href="/js/chunk-vendors.29fc9656.js" rel="preload" as="script"><link href="/css/chunk-vendors.9096ae02.css" rel="stylesheet"><link href="/css/app.de6343fa.css" rel="stylesheet"></head><body><noscript><strong>We're sorry but Scholar doesn't work properly without JavaScript enabled. Please enable it to continue.</strong></noscript><div id="app"></div><script>// window.webpackHotUpdate is present in local development mode
if (!window.webpackHotUpdate) {
var head = document.getElementsByTagName('head')[0]
var script = document.createElement('script')
script.type = 'text/javascript'
script.src = 'https://www.ibm.com/common/stats/ida_stats.js'
head.appendChild(script)
}</script><script src="/js/chunk-vendors.29fc9656.js"></script><script src="/js/app.d95ff0b2.js"></script></body></html>
首先我想通过网站中的表格进行搜索,但是Scrapy
找不到表格。所以我使用 scrapy-spash
但它仍然找不到任何形式:
class IBMSSSpider(scrapy.Spider):
""" A spider to collect articles from IBM SS website """
name = 'ibmss'
start_urls = [
'https://dimsum.eu-gb.containers.appdomain.cloud/' # search?query=reading%20comprehension',
# 'http://google.com'
]
def start_requests(self):
print("start_urls:", self.start_urls)
for url in self.start_urls:
yield SplashRequest(url, self.parse,
args={
# optional; parameters passed to Splash HTTP API
'wait': 0.5,
'url': url,
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
#endpoint='render.json', # optional; default is render.html
#splash_url='<url>', # optional; overrides SPLASH_URL
#slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional
)
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'__BVID__377': 'reading comprehension' },
formxpath='//*[@id="app"]/div[2]/div[1]/div/div[2]/div/div[1]/form',
callback=self.parse_results
)
def parse_results(self, response):
RESULT_LIST = '//*[@id="app"]/div[2]/div[2]/div/div/div/div'
# RESULT_LIST = '//*div[contains(@class, "search-results")]//div[contains(@class, "result")]'
result_listing = response.xpath(RESULT_LIST)
pub_item = PaperItem(pub_type='archive')
for result in result_listing:
pub_url = response.urljoin(result.xpath('.//div[contains(@class, "result-title")]/a/@href').extract_first())
print(pub_url)
yield scrapy.Request(pub_url, callback=self.parse_paper_details,
meta={'result': pub_item})
考虑到网站的来源,请指导我如何运行抓取它?
所以免责声明我无法用 Scrapy 做到这一点。
在 Scrapy 中抓取动态内容
我不确定您需要哪些关于文章的信息,但在抓取动态内容驱动的网站时需要考虑以下几点。
- javascript驱动的网站多少钱?
- 是否有 API 我可以 re-engineering HTTP 请求而不是自动化浏览器 activity? 2.1) 如果是这样,我是否需要 headers、参数和 cookie 来模拟该请求?
- Pre-rendering 带有启动画面的页面
- 使用 selenium 和 scrapy 的最后手段
- 直接在脚本中使用 selenium 模块。
按此顺序进行的原因是,每一个作为潜在的解决方案都会增加刮板变脆的可能性,并且刮板的效率会越来越慢。
最有效的解决方案是寻找 API。
这个网站
检查该网站,您可以看到它完全由 javascript 驱动,这增加了它向 API end-point 发出 AJAX 请求的可能性。使用 chrome 开发工具,您可以看到对 API https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search
我经常使用请求包到 fiddle 周围,首先是 API 端点。因此,在这样做的过程中,我发现它实际上只需要 headers 和您的查询。我假设您正在将阅读理解视为一种搜索,所以我将其用作示例。
我将对在网络工具中找到的请求进行 CURL 复制,并将其复制到 curl.trillworks.com
,从而将 headers 等转换为良好的格式。
出于某种原因,绝对有必要将数据字符串中的 null 传递给此 API。然而,在 python 中传递字典没有 null 等价物,这是能够在 Scrapy 中传递参数的方式(使用 meta 或 cb_kwargs)。我很想看到其他人正在努力让它在 Scrapy 中工作。我可能遗漏了一些关于在请求中传递参数的内容。
代码示例
import requests
headers = {
'Connection': 'keep-alive',
'Accept': 'application/json',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
'Content-Type': 'application/json',
'Origin': 'https://dimsum.eu-gb.containers.appdomain.cloud',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://dimsum.eu-gb.containers.appdomain.cloud/',
'Accept-Language': 'en-US,en;q=0.9',
}
data = '{"query":"reading comprehension","filters":{},"page":0,"size":10,"sort":null,"sessionInfo":""}'
response = requests.post('https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search', headers=headers, data=data)
articles = response.json()['searchResults']['results']
for a in articles:
for b in a['sections']:
title = b['title']
print(title)
print('----------')
for c in b['fragments']:
text= c['text']
print(text)
我们在这里循环显示该页面上的每篇搜索结果文章,每个部分都有一个标题,我们正在循环并打印该标题,然后在该部分中有包含该页面上所有文本的片段。然后我们正在打印它。 同样,我不知道您对这些信息做了什么,所以我无法进一步说明,但您应该能够从中存储您需要的文本。
我必须敦促您自己查看 json object 如果有额外的数据需要,您只需要做一些 json 搜索就可以了。如果你想要 link 到 ARVIX pdf 那么它也在那里。
评论更新
这里是您需要实现的代码示例,才能使它与 scrapy 一起工作。
import scrapy
import json
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['dimsum.eu-gb.containers.appdomain.cloud/']
headers = {
'Connection': 'keep-alive',
'Accept': 'application/json',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
'Content-Type': 'application/json',
'Origin': 'https://dimsum.eu-gb.containers.appdomain.cloud',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://dimsum.eu-gb.containers.appdomain.cloud/',
'Accept-Language': 'en-US,en;q=0.9',
}
cookies = {
'dimsum_user': 'dce0087b-b1ed-4ceb-861a-6dcdc1af500f',
'JSESSIONID': 'node01i38ra486o3eocapxvtryared1263001.node0',
}
data = {"query":"reading comprehension","filters":{},"page":0,"size":10,"sort":null,"sessionInfo":""}
def start_requests(self):
api_url = 'https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search'
yield scrapy.Request(url=api_url,method='POST',headers=self.headers,cb_kwargs={'data':self.data},cookies=self.cookies, callback=self.parse)
def parse(self, response):
articles = response.json()['searchResults']['results']
for a in articles:
for b in a['sections']:
title = b['title']
print(title)
print('----------')
for c in b['fragments']:
text= c['text']
print(text)
问题
Null 不是 python 中的关键字,因此不能用作字典,不幸的是,"sort":null
必须用作我放入数据中的参数variable
。我也尝试过将其转换为 JSON 字符串,但没有成功。
你得到的错误是
data = {"query":"reading comprehension","filters{},"page":0,"size":10,"sort":null,"sessionInfo":""}
NameError: name 'null' is not defined`NameError: name 'null' is not defined
基本的 scrapy 日志
2020-07-30 13:10:10 [scrapy.core.engine] INFO: Spider opened
2020-07-30 13:10:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-30 13:10:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-30 13:10:10 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search> (failed 1 times): 500 Internal Server Error
2020-07-30 13:10:10 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search> (failed 2 times): 500 Internal Server Error
2020-07-30 13:10:10 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <POST https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search> (failed 3 times): 500 Internal Server Error
2020-07-30 13:10:10 [scrapy.core.engine] DEBUG: Crawled (500) <POST https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search> (referer: https://dimsum.eu-gb.containers.appdomain.cloud/)
2020-07-30 13:10:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://dimsum.eu-gb.containers.appdomain.cloud/api/scholar/search>: HTTP status code is not handled or not allowed
2020-07-30 13:10:10 [scrapy.core.engine] INFO: Closing spider (finished)
愿意接受有关此问题的想法和建议。