从没有表单标签但使用 python 输入文本的站点抓取数据
Scraping data from a site that has no form tag but a text input using python
我正在开发一个 python 程序来从 here 中抓取数据。我以前取得过成功,但这次对我来说是一个挑战。我正在用漂亮的汤和机械化。我需要能够在文本框中输入邮政编码以生成我想要的结果。
这是包含输入文本框的片段:
<div id="ContentPlaceHolder1_C001_pnlFindACenter" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ContentPlaceHolder1_C001_btnSearchClient')">
<div style="width: 400px; float: left; padding-top: 5px;">
<label for="ContentPlaceHolder1_C001_tbUserAddress" style="font-family: Arial; font-size: 13.3333px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; line-height: normal; text-decoration: none; text-transform: none; color: rgb(0, 0, 0); cursor: auto; display: inline-block; position: relative; z-index: 100; margin-right: -121px; left: 2px; top: 0px; opacity: 1;">Address, City or Zip:</label><input name="ctl00$ContentPlaceHolder1$C001$tbUserAddress" type="text" id="ContentPlaceHolder1_C001_tbUserAddress" class="textInField" style="width: 240px; background-image: url(""); background-repeat: no-repeat; background-attachment: scroll; background-size: 16px 18px; background-position: 98% 50%; cursor: auto;" data-hasqtip="21" oldtitle="Address, City or Zip:" title="" autocomplete="off" aria-describedby="qtip-21">
<div id="divDistance" style="display: inline;">
within
<select name="ctl00$ContentPlaceHolder1$C001$ddlRadius" id="ContentPlaceHolder1_C001_ddlRadius">
<option value="5">5</option>
<option value="10">10</option>
<option selected="selected" value="25">25</option>
<option value="50">50</option>
<option value="100">100</option>
</select>
miles
</div>
</div>
<div style="width: 160px; float: left;">
<input type="submit" name="ctl00$ContentPlaceHolder1$C001$btnSearchClient" value="Search" onclick="GeocodeLocation();return false;WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$ContentPlaceHolder1$C001$btnSearchClient", "", false, "", "find-a-center", false, false))" id="ContentPlaceHolder1_C001_btnSearchClient" class="btnCenter">
</div>
<div style="clear: both;">
</div>
<div>
<span onchange="" style="font-size:12px;display: inline;" data-hasqtip="22" oldtitle="<b>AASM SleepTM</b> is an innovative telemedicine system that brings your sleep doctor to you. Featuring a secure, web-based video platform, AASM SleepTM allows you to meet with your sleep doctor from a distance. These live video visits will save you time and money. AASM SleepTM also syncs with Fitbit sleep data and has an integrated sleep diary, enabling you and your doctor to monitor your sleep." title="" aria-describedby="qtip-22"><input id="ContentPlaceHolder1_C001_chkSleepTM" type="checkbox" name="ctl00$ContentPlaceHolder1$C001$chkSleepTM"><label for="ContentPlaceHolder1_C001_chkSleepTM">Only show AASM SleepTM capable sleep centers in my state</label></span>
<a href="https://sleeptm.com/" style="font-size: 10px; margin-left: 10px; display: inline;" target="_blank" data-hasqtip="23" oldtitle="<b>AASM SleepTM</b> is an innovative telemedicine system that brings your sleep doctor to you. Featuring a secure, web-based video platform, AASM SleepTM allows you to meet with your sleep doctor from a distance. These live video visits will save you time and money. AASM SleepTM also syncs with Fitbit sleep data and has an integrated sleep diary, enabling you and your doctor to monitor your sleep." title="" aria-describedby="qtip-23">What is AASM SleepTM?</a>
</div>
</div>
到目前为止这些是我的尝试
url = 'http://www.sleepeducation.org/find-a-facility'
MILES = '100'
CODE = '33060'
尝试一个
first = urllib2.Request(url,
data=urllib.urlencode({'value': CODE}),
headers={'User-Agent' : 'Google Chrome' 'Cookie': 'name = ctl00$ContentPlaceHolder1$C001$tbUserAddress'})
尝试两次
post_params = {
'ctl00$ContentPlaceHolder1$C001$tbUserAddress': CODE
}
first = urllib.urlencode(post_params)
driver = webdriver.Chrome()
driver.get(url)
sbox = driver.find_element_by_class_name("ctl00$ContentPlaceHolder1$C001$tbUserAddress")
sbox.send_keys(CODE)
driver.find_element_by_class_name("ctl00$ContentPlaceHolder1$C001$btnSearchClient").click()
尝试 3
br = mechanize.Browser()
br.open(url)
br.select_form(name='ctl00$ContentPlaceHolder1$C001$tbUserAddress')
br['value'] = CODE
br.submit()
http = urllib2.urlopen(br.response())
soup = BeautifulSoup(http, "html5lib")
Error = "no form matching name
'ctl00$ContentPlaceHolder1$C001$tbUserAddress'"
尝试 4
soup.find('input', {'name': 'ctl00$ContentPlaceHolder1$C001$tbUserAddress'})['value'] = CODE
soup.find('input', {'name': 'ctl00$ContentPlaceHolder1$C001$btnSearchClient'}).click()
如果我正确理解你的问题,你想发送带有特定参数的请求,并检查响应。
好的,让我们看看提交后发送到哪里的请求。
让我们打开 Postman。Post request params
我们可以看到 ctl00$ContentPlaceHolder1$C001$tbUserAddress 得到值 100,并且 ctl00$ContentPlaceHolder1$T6B6681F0008$ddlRadius, ctl00$ContentPlaceHolder1$C001$ddlRadius, ctl00$cphTopBar$T917BC451013$rblRadius 得到半径值 25.
所以让我们获取一些数据片段以发送 post 请求并获得所需的响应
我使用 python 个请求
和lxml来解析html响应
我更喜欢 lxml,它比 BeautifulSoup 更难理解,但速度更快。
import requests
from lxml import html
input_data = {
'ctl00$cphTopBar$T917BC451013$rblRadius': 25,
'ctl00$ContentPlaceHolder1$T6B6681F0008$ddlRadius': 25,
'ctl00$ContentPlaceHolder1$C001$ddlRadius': 25,
'ctl00$ContentPlaceHolder1$C001$tbUserAddress': 100
}
resp = requests.post('http://www.sleepeducation.org/find-a-facility', data=input_data)
tree = html.fromstring(resp.text)
print(tree.xpath('//div[@id="ContentPlaceHolder1_C001_map_canvas"]')[0])
我没有足够的声誉来放置文档链接,我会尝试将它们放在评论中,或者您可以 google python 请求 和 python lxml
你也可以用 BeautifulSoup
import BeautifulSoup
import requests
input_data = {
'ctl00$cphTopBar$T917BC451013$rblRadius': 25,
'ctl00$ContentPlaceHolder1$T6B6681F0008$ddlRadius': 25,
'ctl00$ContentPlaceHolder1$C001$ddlRadius': 25,
'ctl00$ContentPlaceHolder1$C001$tbUserAddress': 100
}
resp = requests.post('http://www.sleepeducation.org/find-a-facility', data=input_data)
soup = BeautifulSoup.BeautifulSoup(resp.text)
soup.find("div", {"id": "ContentPlaceHolder1_C001_map_canvas"})
这对我有用
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'http://www.sleepeducation.org/find-a-facility'
subButton = 'ContentPlaceHolder1_C001_btnSearchClient'
addyName = 'ctl00$ContentPlaceHolder1$C001$tbUserAddress'
addyId = 'ContentPlaceHolder1_C001_tbUserAddress'
def usingChromeSelenium():
driver = webdriver.Chrome('C:\Users\documents\chromedriver.exe')
driver.get(url)
sleep(1)
driver.find_element_by_name(addyName).send_keys(CODE)
driver.find_element_by_id(subButton).click()
sleep(1)
html = driver.page_source
return html
results = usingChromeSelenium()
soup = BeautifulSoup(results, "html.parser")
对于“webdriver.Chrome()”,您必须下载 chrome.exe 应用程序文件并将文件路径包含在括号内,它可能会起作用给没有路的你
我正在开发一个 python 程序来从 here 中抓取数据。我以前取得过成功,但这次对我来说是一个挑战。我正在用漂亮的汤和机械化。我需要能够在文本框中输入邮政编码以生成我想要的结果。
这是包含输入文本框的片段:
<div id="ContentPlaceHolder1_C001_pnlFindACenter" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ContentPlaceHolder1_C001_btnSearchClient')">
<div style="width: 400px; float: left; padding-top: 5px;">
<label for="ContentPlaceHolder1_C001_tbUserAddress" style="font-family: Arial; font-size: 13.3333px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; line-height: normal; text-decoration: none; text-transform: none; color: rgb(0, 0, 0); cursor: auto; display: inline-block; position: relative; z-index: 100; margin-right: -121px; left: 2px; top: 0px; opacity: 1;">Address, City or Zip:</label><input name="ctl00$ContentPlaceHolder1$C001$tbUserAddress" type="text" id="ContentPlaceHolder1_C001_tbUserAddress" class="textInField" style="width: 240px; background-image: url(""); background-repeat: no-repeat; background-attachment: scroll; background-size: 16px 18px; background-position: 98% 50%; cursor: auto;" data-hasqtip="21" oldtitle="Address, City or Zip:" title="" autocomplete="off" aria-describedby="qtip-21">
<div id="divDistance" style="display: inline;">
within
<select name="ctl00$ContentPlaceHolder1$C001$ddlRadius" id="ContentPlaceHolder1_C001_ddlRadius">
<option value="5">5</option>
<option value="10">10</option>
<option selected="selected" value="25">25</option>
<option value="50">50</option>
<option value="100">100</option>
</select>
miles
</div>
</div>
<div style="width: 160px; float: left;">
<input type="submit" name="ctl00$ContentPlaceHolder1$C001$btnSearchClient" value="Search" onclick="GeocodeLocation();return false;WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$ContentPlaceHolder1$C001$btnSearchClient", "", false, "", "find-a-center", false, false))" id="ContentPlaceHolder1_C001_btnSearchClient" class="btnCenter">
</div>
<div style="clear: both;">
</div>
<div>
<span onchange="" style="font-size:12px;display: inline;" data-hasqtip="22" oldtitle="<b>AASM SleepTM</b> is an innovative telemedicine system that brings your sleep doctor to you. Featuring a secure, web-based video platform, AASM SleepTM allows you to meet with your sleep doctor from a distance. These live video visits will save you time and money. AASM SleepTM also syncs with Fitbit sleep data and has an integrated sleep diary, enabling you and your doctor to monitor your sleep." title="" aria-describedby="qtip-22"><input id="ContentPlaceHolder1_C001_chkSleepTM" type="checkbox" name="ctl00$ContentPlaceHolder1$C001$chkSleepTM"><label for="ContentPlaceHolder1_C001_chkSleepTM">Only show AASM SleepTM capable sleep centers in my state</label></span>
<a href="https://sleeptm.com/" style="font-size: 10px; margin-left: 10px; display: inline;" target="_blank" data-hasqtip="23" oldtitle="<b>AASM SleepTM</b> is an innovative telemedicine system that brings your sleep doctor to you. Featuring a secure, web-based video platform, AASM SleepTM allows you to meet with your sleep doctor from a distance. These live video visits will save you time and money. AASM SleepTM also syncs with Fitbit sleep data and has an integrated sleep diary, enabling you and your doctor to monitor your sleep." title="" aria-describedby="qtip-23">What is AASM SleepTM?</a>
</div>
</div>
到目前为止这些是我的尝试
url = 'http://www.sleepeducation.org/find-a-facility'
MILES = '100'
CODE = '33060'
尝试一个
first = urllib2.Request(url,
data=urllib.urlencode({'value': CODE}),
headers={'User-Agent' : 'Google Chrome' 'Cookie': 'name = ctl00$ContentPlaceHolder1$C001$tbUserAddress'})
尝试两次
post_params = {
'ctl00$ContentPlaceHolder1$C001$tbUserAddress': CODE
}
first = urllib.urlencode(post_params)
driver = webdriver.Chrome()
driver.get(url)
sbox = driver.find_element_by_class_name("ctl00$ContentPlaceHolder1$C001$tbUserAddress")
sbox.send_keys(CODE)
driver.find_element_by_class_name("ctl00$ContentPlaceHolder1$C001$btnSearchClient").click()
尝试 3
br = mechanize.Browser()
br.open(url)
br.select_form(name='ctl00$ContentPlaceHolder1$C001$tbUserAddress')
br['value'] = CODE
br.submit()
http = urllib2.urlopen(br.response())
soup = BeautifulSoup(http, "html5lib")
Error = "no form matching name 'ctl00$ContentPlaceHolder1$C001$tbUserAddress'"
尝试 4
soup.find('input', {'name': 'ctl00$ContentPlaceHolder1$C001$tbUserAddress'})['value'] = CODE
soup.find('input', {'name': 'ctl00$ContentPlaceHolder1$C001$btnSearchClient'}).click()
如果我正确理解你的问题,你想发送带有特定参数的请求,并检查响应。 好的,让我们看看提交后发送到哪里的请求。 让我们打开 Postman。Post request params
我们可以看到 ctl00$ContentPlaceHolder1$C001$tbUserAddress 得到值 100,并且 ctl00$ContentPlaceHolder1$T6B6681F0008$ddlRadius, ctl00$ContentPlaceHolder1$C001$ddlRadius, ctl00$cphTopBar$T917BC451013$rblRadius 得到半径值 25.
所以让我们获取一些数据片段以发送 post 请求并获得所需的响应
我使用 python 个请求
和lxml来解析html响应
我更喜欢 lxml,它比 BeautifulSoup 更难理解,但速度更快。
import requests
from lxml import html
input_data = {
'ctl00$cphTopBar$T917BC451013$rblRadius': 25,
'ctl00$ContentPlaceHolder1$T6B6681F0008$ddlRadius': 25,
'ctl00$ContentPlaceHolder1$C001$ddlRadius': 25,
'ctl00$ContentPlaceHolder1$C001$tbUserAddress': 100
}
resp = requests.post('http://www.sleepeducation.org/find-a-facility', data=input_data)
tree = html.fromstring(resp.text)
print(tree.xpath('//div[@id="ContentPlaceHolder1_C001_map_canvas"]')[0])
我没有足够的声誉来放置文档链接,我会尝试将它们放在评论中,或者您可以 google python 请求 和 python lxml 你也可以用 BeautifulSoup
import BeautifulSoup
import requests
input_data = {
'ctl00$cphTopBar$T917BC451013$rblRadius': 25,
'ctl00$ContentPlaceHolder1$T6B6681F0008$ddlRadius': 25,
'ctl00$ContentPlaceHolder1$C001$ddlRadius': 25,
'ctl00$ContentPlaceHolder1$C001$tbUserAddress': 100
}
resp = requests.post('http://www.sleepeducation.org/find-a-facility', data=input_data)
soup = BeautifulSoup.BeautifulSoup(resp.text)
soup.find("div", {"id": "ContentPlaceHolder1_C001_map_canvas"})
这对我有用
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'http://www.sleepeducation.org/find-a-facility'
subButton = 'ContentPlaceHolder1_C001_btnSearchClient'
addyName = 'ctl00$ContentPlaceHolder1$C001$tbUserAddress'
addyId = 'ContentPlaceHolder1_C001_tbUserAddress'
def usingChromeSelenium():
driver = webdriver.Chrome('C:\Users\documents\chromedriver.exe')
driver.get(url)
sleep(1)
driver.find_element_by_name(addyName).send_keys(CODE)
driver.find_element_by_id(subButton).click()
sleep(1)
html = driver.page_source
return html
results = usingChromeSelenium()
soup = BeautifulSoup(results, "html.parser")
对于“webdriver.Chrome()”,您必须下载 chrome.exe 应用程序文件并将文件路径包含在括号内,它可能会起作用给没有路的你