使用 python 抓取 javascript 的 CDN 数据
Scraping javascript of CDN data with python
找到了一个大型数据集,其中充满了整齐存储的表格数据,here 我想解析并保存在本地。
问题是,无论我“深入”查看源代码多深,都没有任何实际数据,也没有任何可辨别的源页面。
我的问题是,是否可以通过典型的 requests.get()
和 .content
等访问数据?或者像 selenium
这样的东西会起作用吗?如果不是这两个选项,那是什么?
提前致谢。
请参阅我的评论以了解它的价值这里是应该工作但不工作的请求......出于我不确定的原因,除非他们在 cookie 方面有安全性。
检查页面,它正在向 c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42
发出 POST 请求。你得到的也是结构化格式。您可以在网络工具请求中的预览中看到它。有趣的是 'responseText' 为您提供数据,全部在 html 中。所以理论上你可以只解析这部分数据来获取你需要的东西。问题是当我重新创建这个 HTTP 请求时,根据请求需要的 cookie 的 AppKey 说这是错误的。
所以 selenium 会起作用,不确定我能对 AppKey 做多少。
import requests
cookies = {
'cbParamList': '',
'cbCookieAccepted': '1',
'AppKey': '311a1000697d9171cc1c4128ae42',
'AWSALB': '76fnReAlqLZyJz4gNmSMnGc3oluXMlbsrGwaF+kcm4Rg8fklrjjrxvmez+XxXXg/yDle490fw/MKBNPWCyoGAiihFYgcWQ1RSp0vxSGJHDnfXncHSQuprTjv8Fjk',
'AWSALBCORS': '76fnReAlqLZyJz4gNmSMnGc3oluXMlbsrGwaF+kcm4Rg8fklrjjrxvmez+XxXXg/yDle490fw/MKBNPWCyoGAiihFYgcWQ1RSp0vxSGJHDnfXncHSQuprTjv8Fjk',
}
headers = {
'authority': 'c0cre127.caspio.com',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
'content-type': 'multipart/form-data; boundary=----WebKitFormBoundarykaIBnhjgBEZ0L714',
'accept': '*/*',
'origin': 'https://c0cre127.caspio.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42',
'accept-language': 'en-US,en;q=0.9',
}
params = (
('rnd', '1596940878792'),
)
data = '$------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="cbUniqueFormId"\r\n\r\n_69831fa53c178f\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType1_1"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull1_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="FieldName2"\r\n\r\nDate\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Operator2"\r\n\r\nOR\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteriaDetails2"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType2_1"\r\n\r\n=\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull2_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="FieldName3"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Operator3"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteriaDetails3"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType3_1"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull3_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="FieldName4"\r\n\r\nProperty\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Operator4"\r\n\r\nOR\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteriaDetails4"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType4_1"\r\n\r\n=\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull4_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="FieldName5"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Operator5"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteriaDetails5"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType5_1"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull5_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="FieldName6"\r\n\r\nZone\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Operator6"\r\n\r\nOR\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteriaDetails6"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType6_1"\r\n\r\n=\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull6_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="AppKey"\r\n\r\n311a1000697d9171cc1c4128ae42\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="PrevPageID"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="cbPageType"\r\n\r\nSearch\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="PageID"\r\n\r\n2\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="GlobalOperator"\r\n\r\nAND\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteria"\r\n\r\n6\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Search"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Value2_1"\r\n\r\n04/05/2020\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Value4_1"\r\n\r\nAtterbury\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Value6_1"\r\n\r\nCentral\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ClientQueryString"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="AjaxAction"\r\n\r\nSearchForm\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="GridMode"\r\n\r\nFalse\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="cbUniqueFormId"\r\n\r\n_69831fa53c178f\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="AjaxActionHostName"\r\n\r\nhttps://c0cre127.caspio.com\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="cbAjaxReferrer"\r\n\r\nhttps://c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714--\r\n'
response = requests.post('https://c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42', headers=headers, params=params, cookies=cookies, data=data)
输出
'Undefined AppKey. (<a href="http://www.caspio.com/l/default.ashx?s=157">Caspio Bridge</a> error) (60011)'
更新
找到了一个大型数据集,其中充满了整齐存储的表格数据,here 我想解析并保存在本地。
问题是,无论我“深入”查看源代码多深,都没有任何实际数据,也没有任何可辨别的源页面。
我的问题是,是否可以通过典型的 requests.get()
和 .content
等访问数据?或者像 selenium
这样的东西会起作用吗?如果不是这两个选项,那是什么?
提前致谢。
请参阅我的评论以了解它的价值这里是应该工作但不工作的请求......出于我不确定的原因,除非他们在 cookie 方面有安全性。
检查页面,它正在向 c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42
发出 POST 请求。你得到的也是结构化格式。您可以在网络工具请求中的预览中看到它。有趣的是 'responseText' 为您提供数据,全部在 html 中。所以理论上你可以只解析这部分数据来获取你需要的东西。问题是当我重新创建这个 HTTP 请求时,根据请求需要的 cookie 的 AppKey 说这是错误的。
所以 selenium 会起作用,不确定我能对 AppKey 做多少。
import requests
cookies = {
'cbParamList': '',
'cbCookieAccepted': '1',
'AppKey': '311a1000697d9171cc1c4128ae42',
'AWSALB': '76fnReAlqLZyJz4gNmSMnGc3oluXMlbsrGwaF+kcm4Rg8fklrjjrxvmez+XxXXg/yDle490fw/MKBNPWCyoGAiihFYgcWQ1RSp0vxSGJHDnfXncHSQuprTjv8Fjk',
'AWSALBCORS': '76fnReAlqLZyJz4gNmSMnGc3oluXMlbsrGwaF+kcm4Rg8fklrjjrxvmez+XxXXg/yDle490fw/MKBNPWCyoGAiihFYgcWQ1RSp0vxSGJHDnfXncHSQuprTjv8Fjk',
}
headers = {
'authority': 'c0cre127.caspio.com',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
'content-type': 'multipart/form-data; boundary=----WebKitFormBoundarykaIBnhjgBEZ0L714',
'accept': '*/*',
'origin': 'https://c0cre127.caspio.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42',
'accept-language': 'en-US,en;q=0.9',
}
params = (
('rnd', '1596940878792'),
)
data = '$------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="cbUniqueFormId"\r\n\r\n_69831fa53c178f\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType1_1"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull1_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="FieldName2"\r\n\r\nDate\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Operator2"\r\n\r\nOR\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteriaDetails2"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType2_1"\r\n\r\n=\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull2_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="FieldName3"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Operator3"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteriaDetails3"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType3_1"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull3_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="FieldName4"\r\n\r\nProperty\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Operator4"\r\n\r\nOR\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteriaDetails4"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType4_1"\r\n\r\n=\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull4_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="FieldName5"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Operator5"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteriaDetails5"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType5_1"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull5_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="FieldName6"\r\n\r\nZone\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Operator6"\r\n\r\nOR\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteriaDetails6"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ComparisonType6_1"\r\n\r\n=\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="MatchNull6_1"\r\n\r\nN\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="AppKey"\r\n\r\n311a1000697d9171cc1c4128ae42\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="PrevPageID"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="cbPageType"\r\n\r\nSearch\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="PageID"\r\n\r\n2\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="GlobalOperator"\r\n\r\nAND\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="NumCriteria"\r\n\r\n6\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Search"\r\n\r\n1\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Value2_1"\r\n\r\n04/05/2020\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Value4_1"\r\n\r\nAtterbury\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="Value6_1"\r\n\r\nCentral\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="ClientQueryString"\r\n\r\n\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="AjaxAction"\r\n\r\nSearchForm\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="GridMode"\r\n\r\nFalse\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="cbUniqueFormId"\r\n\r\n_69831fa53c178f\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="AjaxActionHostName"\r\n\r\nhttps://c0cre127.caspio.com\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714\r\nContent-Disposition: form-data; name="cbAjaxReferrer"\r\n\r\nhttps://c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42\r\n------WebKitFormBoundarykaIBnhjgBEZ0L714--\r\n'
response = requests.post('https://c0cre127.caspio.com/dp/311a1000697d9171cc1c4128ae42', headers=headers, params=params, cookies=cookies, data=data)
输出
'Undefined AppKey. (<a href="http://www.caspio.com/l/default.ashx?s=157">Caspio Bridge</a> error) (60011)'