机械化和 Python,点击 href="javascript:void(0);" 链接并获得响应
Mechanize and Python, clicking href="javascript:void(0);" links and getting the response back
我需要从我填写表格的页面中删除一些数据(已经用 mechanize 做到了)。问题是,页面 returns 数据在很多页面上,我无法从这些页面获取数据。
从第一个结果页面获取它们没有问题,因为它已经在搜索后显示 - 我只需提交表单并获得响应。
我分析了结果页面的源代码,它似乎使用了 Java 脚本,RichFaces(一些带有 ajax 的 JSF 库,但我可能是错的,因为我不是网络专家).
但是,我设法弄清楚了如何访问剩余的结果页面。我需要点击这种形式的链接(href="javascript:void(0);"
,完整代码如下):
<td class="pageNumber"><span class="rf-ds " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233"><span class="rf-ds-nmb-btn rf-ds-act " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1">1</span><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2">2</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3">3</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4">4</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5">5</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6">6</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7">7</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8">8</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9">9</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10">10</a><a class="rf-ds-btn rf-ds-btn-next" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next">»</a><a class="rf-ds-btn rf-ds-btn-last" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l">»»»»</a>
<script type="text/javascript">new RichFaces.ui.DataScroller("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",function(event,element,data){RichFaces.ajax("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",event,{"parameters":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233:page":data.page} ,"incId":"1"} )},{"digitals":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9":"9","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8":"8","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7":"7","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6":"6","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5":"5","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4":"4","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3":"3","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1":"1","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10":"10","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2":"2"} ,"buttons":{"right":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next":"next","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l":"last"} } ,"currentPage":1} )</script></span></td>
<td class="pageExport"><script type="text/javascript" src="/opi/javax.faces.resource/download.js?ln=js/component&b="></script><script type="text/javascript">
所以我想问一下有没有办法点击所有的链接并使用机械化获取所有的页面(注意,在»
符号之后有更多可用的页面)?我询问具有网络知识的所有傻瓜的答案:)
这对我有用:html 似乎在 page
中可用
import time
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie')
next_id = 'drhPageForm:drhPageTable:j_idt211:j_idt233_ds_next'
pages = []
it = 0
while it < 1795:
time.sleep(1)
it += 1
bad = True
while bad:
try:
driver.find_element_by_id(next_id).click()
bad = False
except:
print('retry')
page = driver.page_source
pages.append(page)
除了首先收集和存储所有 html,您还可以只查询您想要的内容,但您需要 lxml
或 BeautifulSoup
。
编辑:在 运行 之后,我确实注意到我们犯了一个错误。捕获异常并重试很简单。
首先,我仍然会坚持使用 selenium,因为这是一个相当 "javascript-heavy" 的网站。请注意,如果需要,您可以使用无头浏览器 (PhantomJS
or with a virtual display)。
这里的想法是每页按 100 行分页,单击“>>”link 直到它不出现在页面上,这意味着我们已经到达最后一页并在那里没有更多的结果需要处理。为了使解决方案可靠,我们需要使用 Explicit Waits:每次我们进入下一页 - 等待加载微调器的不可见性。
工作实施:
# -*- coding: utf-8 -*-
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.maximize_window()
driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie?execution=e1s1')
wait = WebDriverWait(driver, 30)
# paginate by 100
select = Select(driver.find_element_by_id("drhPageForm:drhPageTable:j_idt211:j_idt214:j_idt220"))
select.select_by_visible_text("100")
while True:
# wait until there is no loading spinner
wait.until(EC.invisibility_of_element_located((By.ID, "loadingPopup_content_scroller")))
current_page = driver.find_element_by_class_name("rf-ds-act").text
print("Current page: %d" % current_page)
# TODO: collect the results
# proceed to the next page
try:
next_page = driver.find_element_by_link_text(u"»")
next_page.click()
except NoSuchElementException:
break
我需要从我填写表格的页面中删除一些数据(已经用 mechanize 做到了)。问题是,页面 returns 数据在很多页面上,我无法从这些页面获取数据。
从第一个结果页面获取它们没有问题,因为它已经在搜索后显示 - 我只需提交表单并获得响应。
我分析了结果页面的源代码,它似乎使用了 Java 脚本,RichFaces(一些带有 ajax 的 JSF 库,但我可能是错的,因为我不是网络专家).
但是,我设法弄清楚了如何访问剩余的结果页面。我需要点击这种形式的链接(href="javascript:void(0);"
,完整代码如下):
<td class="pageNumber"><span class="rf-ds " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233"><span class="rf-ds-nmb-btn rf-ds-act " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1">1</span><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2">2</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3">3</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4">4</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5">5</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6">6</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7">7</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8">8</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9">9</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10">10</a><a class="rf-ds-btn rf-ds-btn-next" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next">»</a><a class="rf-ds-btn rf-ds-btn-last" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l">»»»»</a>
<script type="text/javascript">new RichFaces.ui.DataScroller("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",function(event,element,data){RichFaces.ajax("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",event,{"parameters":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233:page":data.page} ,"incId":"1"} )},{"digitals":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9":"9","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8":"8","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7":"7","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6":"6","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5":"5","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4":"4","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3":"3","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1":"1","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10":"10","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2":"2"} ,"buttons":{"right":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next":"next","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l":"last"} } ,"currentPage":1} )</script></span></td>
<td class="pageExport"><script type="text/javascript" src="/opi/javax.faces.resource/download.js?ln=js/component&b="></script><script type="text/javascript">
所以我想问一下有没有办法点击所有的链接并使用机械化获取所有的页面(注意,在»
符号之后有更多可用的页面)?我询问具有网络知识的所有傻瓜的答案:)
这对我有用:html 似乎在 page
import time
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie')
next_id = 'drhPageForm:drhPageTable:j_idt211:j_idt233_ds_next'
pages = []
it = 0
while it < 1795:
time.sleep(1)
it += 1
bad = True
while bad:
try:
driver.find_element_by_id(next_id).click()
bad = False
except:
print('retry')
page = driver.page_source
pages.append(page)
除了首先收集和存储所有 html,您还可以只查询您想要的内容,但您需要 lxml
或 BeautifulSoup
。
编辑:在 运行 之后,我确实注意到我们犯了一个错误。捕获异常并重试很简单。
首先,我仍然会坚持使用 selenium,因为这是一个相当 "javascript-heavy" 的网站。请注意,如果需要,您可以使用无头浏览器 (PhantomJS
or with a virtual display)。
这里的想法是每页按 100 行分页,单击“>>”link 直到它不出现在页面上,这意味着我们已经到达最后一页并在那里没有更多的结果需要处理。为了使解决方案可靠,我们需要使用 Explicit Waits:每次我们进入下一页 - 等待加载微调器的不可见性。
工作实施:
# -*- coding: utf-8 -*-
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.maximize_window()
driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie?execution=e1s1')
wait = WebDriverWait(driver, 30)
# paginate by 100
select = Select(driver.find_element_by_id("drhPageForm:drhPageTable:j_idt211:j_idt214:j_idt220"))
select.select_by_visible_text("100")
while True:
# wait until there is no loading spinner
wait.until(EC.invisibility_of_element_located((By.ID, "loadingPopup_content_scroller")))
current_page = driver.find_element_by_class_name("rf-ds-act").text
print("Current page: %d" % current_page)
# TODO: collect the results
# proceed to the next page
try:
next_page = driver.find_element_by_link_text(u"»")
next_page.click()
except NoSuchElementException:
break