抓取包含用 URL 隐藏编写的 _dopostback 方法的网站
Scraping a website that contains _dopostback method written with URL hidden
我是 Scrapy
的新手。我正在尝试抓取 asp 中的 this 网站,其中包含各种配置文件。它共有 259 页。要浏览页面,底部有几个链接,如 1、2、3....因此 on.These 链接使用 _dopostback
href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$RepeaterPaging$ctl00$Pagingbtn','')"
对于每一页,只有粗体文本会发生变化。如何使用 scrapy 遍历多个页面并提取信息?表单数据如下:
__EVENTTARGET: ctl00%24ContentPlaceHolder1%24RepeaterPaging%24ctl01%24Pagingbtn
__EVENTARGUMENT:
__VIEWSTATE: %2FwEPDwUKMTk1MjIxNTU1Mw8WAh4HdG90cGFnZQKDAhYCZg9kFgICAw9kFgICAQ9kFgoCAQ8WAh4LXyFJdGVtQ291bnQCFBYoZg9kFgJmDxUFCDY0MzMuanBnCzggR2VtcyBMdGQuCzggR2VtcyBMdGQuBDY0MzMKOTgyOTEwODA3MGQCAQ9kFgJmDxUFCDMzNTkuanBnCDkgSmV3ZWxzCDkgSmV3ZWxzBDMzNTkKOTg4NzAwNzg4OGQCAg9kFgJmDxUFCDc4NTEuanBnD0EgLSBTcXVhcmUgR2Vtcw9BIC0gU3F1YXJlIEdlbXMENzg1MQo5OTI5NjA3ODY4ZAIDD2QWAmYPFQUIMTg3My5qcGcLQSAmIEEgSW1wZXgLQSAmIEEgSW1wZXgEMTg3Mwo5MzE0Njk1ODc0ZAIED2QWAmYPFQUINzc5Ni5qcGcTQSAmIE0gR2VtcyAmIEpld2VscxNBICYgTSBHZW1zICYgSmV3ZWxzBDc3OTYKOTkyOTk0MjE4NWQCBQ9kFgJmDxUFCDc2NjYuanBnDEEgQSBBICBJbXBleAxBIEEgQSAgSW1wZXgENzY2Ngo4MjkwNzkwNzU3ZAIGD2QWAmYPFQUINjM2OC5qcGcaQSBBIEEgJ3MgIEdlbXMgQ29ycG9yYXRpb24aQSBBIEEgJ3MgIEdlbXMgQ29ycG9yYXRpb24ENjM2OAo5ODI5MDU2MzM0ZAIHD2QWAmYPFQUINjM2OS5qcGcPQSBBIEEgJ3MgSmV3ZWxzD0EgQSBBICdzIEpld2VscwQ2MzY5Cjk4MjkwNTYzMzRkAggPZBYCZg8VBQg3OTQ3LmpwZwxBIEcgIFMgSW1wZXgMQSBHICBTIEltcGV4BDc5NDcKODk0Nzg2MzExNGQCCQ9kFgJmDxUFCDc4ODkuanBnCkEgTSBCIEdlbXMKQSBNIEIgR2VtcwQ3ODg5Cjk4MjkwMTMyODJkAgoPZBYCZg8VBQgzNDI2LmpwZxBBIE0gRyAgSmV3ZWxsZXJ5EEEgTSBHICBKZXdlbGxlcnkEMzQyNgo5MzE0NTExNDQ0ZAILD2QWAmYPFQUIMTgyNS5qcGcWQSBOYXR1cmFsIEdlbXMgTi4gQXJ0cxZBIE5hdHVyYWwgR2VtcyBOLiBBcnRzBDE4MjUKOTgyODAxMTU4NWQCDA9kFgJmDxUFCDU3MjYuanBnC0EgUiBEZXNpZ25zC0EgUiBEZXNpZ25zBDU3MjYAZAIND2QWAmYPFQUINzM4OS5qcGcOQSBSYXdhdCBFeHBvcnQOQSBSYXdhdCBFeHBvcnQENzM4OQBkAg4PZBYCZg8VBQg1NDcwLmpwZxBBLiBBLiAgSmV3ZWxsZXJzEEEuIEEuICBKZXdlbGxlcnMENTQ3MAo5OTI4MTA5NDUxZAIPD2QWAmYPFQUIMTg5OS5qcGcSQS4gQS4gQS4ncyBFeHBvcnRzEkEuIEEuIEEuJ3MgRXhwb3J0cwQxODk5Cjk4MjkwNTYzMzRkAhAPZBYCZg8VBQg0MDE5LmpwZwpBLiBCLiBHZW1zCkEuIEIuIEdlbXMENDAxOQo5ODI5MDE2Njg4ZAIRD2QWAmYPFQUIMzM3OS5qcGcPQS4gQi4gSmV3ZWxsZXJzD0EuIEIuIEpld2VsbGVycwQzMzc5Cjk4MjkwMzA1MzZkAhIPZBYCZg8VBQgzMTc5LmpwZwxBLiBDLiBSYXRhbnMMQS4gQy4gUmF0YW5zBDMxNzkKOTgyOTY2NjYyNWQCEw9kFgJmDxUFCDc3NTEuanBnD0EuIEcuICYgQ29tcGFueQ9BLiBHLiAmIENvbXBhbnkENzc1MQo5ODI5MTUzMzUzZAIDDw8WAh4HRW5hYmxlZGhkZAIFDw8WAh8CaGRkAgcPPCsACQIADxYEHghEYXRhS2V5cxYAHwECCmQBFgQeD0hvcml6b250YWxBbGlnbgsqKVN5c3RlbS5XZWIuVUkuV2ViQ29udHJvbHMuSG9yaXpvbnRhbEFsaWduAh4EXyFTQgKAgAQWFGYPZBYCAgEPDxYKHg9Db21tYW5kQXJndW1lbnQFATAeBFRleHQFATEeCUJhY2tDb2xvcgoAHwJoHwUCCGRkAgEPZBYCAgEPDxYEHwYFATEfBwUBMmRkAgIPZBYCAgEPDxYEHwYFATIfBwUBM2RkAgMPZBYCAgEPDxYEHwYFATMfBwUBNGRkAgQPZBYCAgEPDxYEHwYFATQfBwUBNWRkAgUPZBYCAgEPDxYEHwYFATUfBwUBNmRkAgYPZBYCAgEPDxYEHwYFATYfBwUBN2RkAgcPZBYCAgEPDxYEHwYFATcfBwUBOGRkAggPZBYCAgEPDxYEHwYFATgfBwUBOWRkAgkPZBYCAgEPDxYEHwYFATkfBwUCMTBkZAINDw8WAh8HBQ1QYWdlIDEgb2YgMjU5ZGRkfEDzDJt%2FoSnSGPBGHlKDPRi%2Fbk0%3D
__EVENTVALIDATION: %2FwEWDALTg7oVAsGH9qQBAsGHisMBAsGHjuEPAsGHotEBAsGHpu8BAsGHupUCAsGH%2FmACwYeS0QICwYeW7wIC%2FLHNngECkI3CyQtVVahoNpNIXsQI6oDrxjKGcAokIA%3D%3D
我查看了多个建议查看 post
调用和使用参数的解决方案和帖子,但我无法理解 post
中提供的参数。
简而言之,您只需发送 __EVENTTARGET
、__EVENTARGUMENT
、__VIEWSTATE
和 __EVENTVALIDATION
。
__EVENTTARGET
: ctl00$ContentPlaceHolder1$RepeaterPaging$ctl00$Pagingbtn, 改变粗体字得到不同的页面。
__EVENTARGUMENT
: 始终为空
__VIEWSTATE
:在一个输入标签中,id 为 __VIEWSTATE
__EVENTVALIDATION
:在一个输入标签中,id 为 __EVENTVALIDATION
值得一提的是,当您提取名称时,实际的 xpath 可能与您从 Chrome.
复制的不同
Actual xpath: //*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()
Chrome version: //*[@id="aspnetForm"]/div[3]/section/div/div/div[1]/div/h3/text()
更新:05以后的页面,每次更新__VIEWSTATE
和__EVENTVALIDATION
,"ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn"作为[=13] =] 获取下一页。
__EVENTTARGET
中的00
部分与当前页面相关,例如:
1 2 3 4 5 6 7 8 9 10
00 01 02 03 04 05 06 07 08 09
^^
To get page 7: use index 06
------------------------------
2 3 4 5 6 7 8 9 10 11
00 01 02 03 04 05 06 07 08 09
^^
To get page 8: use index 06
------------------------------
12 13 14 15 16 17 18 19 20 21
00 01 02 03 04 05 06 07 08 09
^^
To get page 18: use index 06
------------------------------
current page: ^^
__EVENTTARGET
的另一部分保持不变,这意味着当前页面在 __VIEWSTATE
中编码(和 __EVENTVALIDATION
?不太确定,但没关系)。我们可以提取并再次发送它们以显示我们现在位于第 10、100 页的服务器,...
要获取下一页,我们可以使用固定的__EVENTTARGET
:ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn.
当然可以使用ctl00$ContentPlaceHolder1$RepeaterPaging$ctl07$Pagingbtn来获取下2页。
这是一个演示(已更新):
# SO Debug Spider
# OUTPUT: 2018-07-22 10:54:31 [SOSpider] INFO: ['Aadinath Gems & Jewels']
# The first person of page 4 is Aadinath Gems & Jewels
#
# OUTPUT: 2018-07-23 10:52:07 [SOSpider] ERROR: ['Ajay Purohit']
# The first person of page 12 is Ajay Purohit
import scrapy
class SOSpider(scrapy.Spider):
name = "SOSpider"
url = "http://www.jajaipur.com/Member_List.aspx"
def start_requests(self):
yield scrapy.Request(url=self.url, callback=self.parse_form_0_5)
def parse_form_0_5(self, response):
selector = scrapy.Selector(response=response)
VIEWSTATE = selector.xpath('//*[@id="__VIEWSTATE"]/@value').extract_first()
EVENTVALIDATION = selector.xpath('//*[@id="__EVENTVALIDATION"]/@value').extract_first()
# It's fine to use this method from page 1 to page 5
formdata = {
# change pages here
"__EVENTTARGET": "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl03$Pagingbtn",
"__EVENTARGUMENT": "",
"__VIEWSTATE": VIEWSTATE,
"__EVENTVALIDATION": EVENTVALIDATION,
}
yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse_0_5)
# After page 5, you should try this
# get page 6
formdata["__EVENTTARGET"] = "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl05$Pagingbtn"
yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse, meta={"PAGE": 6})
def parse(self, response):
# use a metadata to control when to break
currPage = response.meta["PAGE"]
if (currPage == 15):
return
# extract names here
selector = scrapy.Selector(response=response)
names = selector.xpath('//*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()').extract()
self.logger.error(names)
# parse VIEWSTATE and EVENTVALIDATION again,
# which contain current page
VIEWSTATE = selector.xpath('//*[@id="__VIEWSTATE"]/@value').extract_first()
EVENTVALIDATION = selector.xpath('//*[@id="__EVENTVALIDATION"]/@value').extract_first()
# get next page
formdata = {
# 06 is the next 1 page, 07 is the next 2 page, ...
"__EVENTTARGET": "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn",
"__EVENTARGUMENT": "",
"__VIEWSTATE": VIEWSTATE,
"__EVENTVALIDATION": EVENTVALIDATION,
}
yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse, meta={"PAGE": currPage+1})
def parse_0_5(self, response):
selector = scrapy.Selector(response=response)
# only extract name
names = selector.xpath('//*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()').extract()
self.logger.error(names)
我是 Scrapy
的新手。我正在尝试抓取 asp 中的 this 网站,其中包含各种配置文件。它共有 259 页。要浏览页面,底部有几个链接,如 1、2、3....因此 on.These 链接使用 _dopostback
href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$RepeaterPaging$ctl00$Pagingbtn','')"
对于每一页,只有粗体文本会发生变化。如何使用 scrapy 遍历多个页面并提取信息?表单数据如下:
__EVENTTARGET: ctl00%24ContentPlaceHolder1%24RepeaterPaging%24ctl01%24Pagingbtn
__EVENTARGUMENT:
__VIEWSTATE: %2FwEPDwUKMTk1MjIxNTU1Mw8WAh4HdG90cGFnZQKDAhYCZg9kFgICAw9kFgICAQ9kFgoCAQ8WAh4LXyFJdGVtQ291bnQCFBYoZg9kFgJmDxUFCDY0MzMuanBnCzggR2VtcyBMdGQuCzggR2VtcyBMdGQuBDY0MzMKOTgyOTEwODA3MGQCAQ9kFgJmDxUFCDMzNTkuanBnCDkgSmV3ZWxzCDkgSmV3ZWxzBDMzNTkKOTg4NzAwNzg4OGQCAg9kFgJmDxUFCDc4NTEuanBnD0EgLSBTcXVhcmUgR2Vtcw9BIC0gU3F1YXJlIEdlbXMENzg1MQo5OTI5NjA3ODY4ZAIDD2QWAmYPFQUIMTg3My5qcGcLQSAmIEEgSW1wZXgLQSAmIEEgSW1wZXgEMTg3Mwo5MzE0Njk1ODc0ZAIED2QWAmYPFQUINzc5Ni5qcGcTQSAmIE0gR2VtcyAmIEpld2VscxNBICYgTSBHZW1zICYgSmV3ZWxzBDc3OTYKOTkyOTk0MjE4NWQCBQ9kFgJmDxUFCDc2NjYuanBnDEEgQSBBICBJbXBleAxBIEEgQSAgSW1wZXgENzY2Ngo4MjkwNzkwNzU3ZAIGD2QWAmYPFQUINjM2OC5qcGcaQSBBIEEgJ3MgIEdlbXMgQ29ycG9yYXRpb24aQSBBIEEgJ3MgIEdlbXMgQ29ycG9yYXRpb24ENjM2OAo5ODI5MDU2MzM0ZAIHD2QWAmYPFQUINjM2OS5qcGcPQSBBIEEgJ3MgSmV3ZWxzD0EgQSBBICdzIEpld2VscwQ2MzY5Cjk4MjkwNTYzMzRkAggPZBYCZg8VBQg3OTQ3LmpwZwxBIEcgIFMgSW1wZXgMQSBHICBTIEltcGV4BDc5NDcKODk0Nzg2MzExNGQCCQ9kFgJmDxUFCDc4ODkuanBnCkEgTSBCIEdlbXMKQSBNIEIgR2VtcwQ3ODg5Cjk4MjkwMTMyODJkAgoPZBYCZg8VBQgzNDI2LmpwZxBBIE0gRyAgSmV3ZWxsZXJ5EEEgTSBHICBKZXdlbGxlcnkEMzQyNgo5MzE0NTExNDQ0ZAILD2QWAmYPFQUIMTgyNS5qcGcWQSBOYXR1cmFsIEdlbXMgTi4gQXJ0cxZBIE5hdHVyYWwgR2VtcyBOLiBBcnRzBDE4MjUKOTgyODAxMTU4NWQCDA9kFgJmDxUFCDU3MjYuanBnC0EgUiBEZXNpZ25zC0EgUiBEZXNpZ25zBDU3MjYAZAIND2QWAmYPFQUINzM4OS5qcGcOQSBSYXdhdCBFeHBvcnQOQSBSYXdhdCBFeHBvcnQENzM4OQBkAg4PZBYCZg8VBQg1NDcwLmpwZxBBLiBBLiAgSmV3ZWxsZXJzEEEuIEEuICBKZXdlbGxlcnMENTQ3MAo5OTI4MTA5NDUxZAIPD2QWAmYPFQUIMTg5OS5qcGcSQS4gQS4gQS4ncyBFeHBvcnRzEkEuIEEuIEEuJ3MgRXhwb3J0cwQxODk5Cjk4MjkwNTYzMzRkAhAPZBYCZg8VBQg0MDE5LmpwZwpBLiBCLiBHZW1zCkEuIEIuIEdlbXMENDAxOQo5ODI5MDE2Njg4ZAIRD2QWAmYPFQUIMzM3OS5qcGcPQS4gQi4gSmV3ZWxsZXJzD0EuIEIuIEpld2VsbGVycwQzMzc5Cjk4MjkwMzA1MzZkAhIPZBYCZg8VBQgzMTc5LmpwZwxBLiBDLiBSYXRhbnMMQS4gQy4gUmF0YW5zBDMxNzkKOTgyOTY2NjYyNWQCEw9kFgJmDxUFCDc3NTEuanBnD0EuIEcuICYgQ29tcGFueQ9BLiBHLiAmIENvbXBhbnkENzc1MQo5ODI5MTUzMzUzZAIDDw8WAh4HRW5hYmxlZGhkZAIFDw8WAh8CaGRkAgcPPCsACQIADxYEHghEYXRhS2V5cxYAHwECCmQBFgQeD0hvcml6b250YWxBbGlnbgsqKVN5c3RlbS5XZWIuVUkuV2ViQ29udHJvbHMuSG9yaXpvbnRhbEFsaWduAh4EXyFTQgKAgAQWFGYPZBYCAgEPDxYKHg9Db21tYW5kQXJndW1lbnQFATAeBFRleHQFATEeCUJhY2tDb2xvcgoAHwJoHwUCCGRkAgEPZBYCAgEPDxYEHwYFATEfBwUBMmRkAgIPZBYCAgEPDxYEHwYFATIfBwUBM2RkAgMPZBYCAgEPDxYEHwYFATMfBwUBNGRkAgQPZBYCAgEPDxYEHwYFATQfBwUBNWRkAgUPZBYCAgEPDxYEHwYFATUfBwUBNmRkAgYPZBYCAgEPDxYEHwYFATYfBwUBN2RkAgcPZBYCAgEPDxYEHwYFATcfBwUBOGRkAggPZBYCAgEPDxYEHwYFATgfBwUBOWRkAgkPZBYCAgEPDxYEHwYFATkfBwUCMTBkZAINDw8WAh8HBQ1QYWdlIDEgb2YgMjU5ZGRkfEDzDJt%2FoSnSGPBGHlKDPRi%2Fbk0%3D
__EVENTVALIDATION: %2FwEWDALTg7oVAsGH9qQBAsGHisMBAsGHjuEPAsGHotEBAsGHpu8BAsGHupUCAsGH%2FmACwYeS0QICwYeW7wIC%2FLHNngECkI3CyQtVVahoNpNIXsQI6oDrxjKGcAokIA%3D%3D
我查看了多个建议查看 post
调用和使用参数的解决方案和帖子,但我无法理解 post
中提供的参数。
简而言之,您只需发送 __EVENTTARGET
、__EVENTARGUMENT
、__VIEWSTATE
和 __EVENTVALIDATION
。
__EVENTTARGET
: ctl00$ContentPlaceHolder1$RepeaterPaging$ctl00$Pagingbtn, 改变粗体字得到不同的页面。__EVENTARGUMENT
: 始终为空__VIEWSTATE
:在一个输入标签中,id 为__VIEWSTATE
__EVENTVALIDATION
:在一个输入标签中,id 为__EVENTVALIDATION
值得一提的是,当您提取名称时,实际的 xpath 可能与您从 Chrome.
复制的不同Actual xpath: //*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()
Chrome version: //*[@id="aspnetForm"]/div[3]/section/div/div/div[1]/div/h3/text()
更新:05以后的页面,每次更新__VIEWSTATE
和__EVENTVALIDATION
,"ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn"作为[=13] =] 获取下一页。
__EVENTTARGET
中的00
部分与当前页面相关,例如:
1 2 3 4 5 6 7 8 9 10
00 01 02 03 04 05 06 07 08 09
^^
To get page 7: use index 06
------------------------------
2 3 4 5 6 7 8 9 10 11
00 01 02 03 04 05 06 07 08 09
^^
To get page 8: use index 06
------------------------------
12 13 14 15 16 17 18 19 20 21
00 01 02 03 04 05 06 07 08 09
^^
To get page 18: use index 06
------------------------------
current page: ^^
__EVENTTARGET
的另一部分保持不变,这意味着当前页面在 __VIEWSTATE
中编码(和 __EVENTVALIDATION
?不太确定,但没关系)。我们可以提取并再次发送它们以显示我们现在位于第 10、100 页的服务器,...
要获取下一页,我们可以使用固定的__EVENTTARGET
:ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn.
当然可以使用ctl00$ContentPlaceHolder1$RepeaterPaging$ctl07$Pagingbtn来获取下2页。
这是一个演示(已更新):
# SO Debug Spider
# OUTPUT: 2018-07-22 10:54:31 [SOSpider] INFO: ['Aadinath Gems & Jewels']
# The first person of page 4 is Aadinath Gems & Jewels
#
# OUTPUT: 2018-07-23 10:52:07 [SOSpider] ERROR: ['Ajay Purohit']
# The first person of page 12 is Ajay Purohit
import scrapy
class SOSpider(scrapy.Spider):
name = "SOSpider"
url = "http://www.jajaipur.com/Member_List.aspx"
def start_requests(self):
yield scrapy.Request(url=self.url, callback=self.parse_form_0_5)
def parse_form_0_5(self, response):
selector = scrapy.Selector(response=response)
VIEWSTATE = selector.xpath('//*[@id="__VIEWSTATE"]/@value').extract_first()
EVENTVALIDATION = selector.xpath('//*[@id="__EVENTVALIDATION"]/@value').extract_first()
# It's fine to use this method from page 1 to page 5
formdata = {
# change pages here
"__EVENTTARGET": "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl03$Pagingbtn",
"__EVENTARGUMENT": "",
"__VIEWSTATE": VIEWSTATE,
"__EVENTVALIDATION": EVENTVALIDATION,
}
yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse_0_5)
# After page 5, you should try this
# get page 6
formdata["__EVENTTARGET"] = "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl05$Pagingbtn"
yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse, meta={"PAGE": 6})
def parse(self, response):
# use a metadata to control when to break
currPage = response.meta["PAGE"]
if (currPage == 15):
return
# extract names here
selector = scrapy.Selector(response=response)
names = selector.xpath('//*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()').extract()
self.logger.error(names)
# parse VIEWSTATE and EVENTVALIDATION again,
# which contain current page
VIEWSTATE = selector.xpath('//*[@id="__VIEWSTATE"]/@value').extract_first()
EVENTVALIDATION = selector.xpath('//*[@id="__EVENTVALIDATION"]/@value').extract_first()
# get next page
formdata = {
# 06 is the next 1 page, 07 is the next 2 page, ...
"__EVENTTARGET": "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn",
"__EVENTARGUMENT": "",
"__VIEWSTATE": VIEWSTATE,
"__EVENTVALIDATION": EVENTVALIDATION,
}
yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse, meta={"PAGE": currPage+1})
def parse_0_5(self, response):
selector = scrapy.Selector(response=response)
# only extract name
names = selector.xpath('//*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()').extract()
self.logger.error(names)