从 href 抓取数据
Scraping data from href
我试图获取 DFS 的邮政编码,为此我尝试获取每个商店的 href,然后单击它,下一页有商店位置,我可以从中获取邮政编码,但我可以为了让事情正常进行,我哪里出错了?
我尝试先获取上层属性 td.searchResults
,然后尝试点击 href with title DFS
并点击获取 postalCode。最终迭代所有三个页面。
如果有更好的方法,请告诉我。
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
while True:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops= {}
#info = soup.find('span', itemprop='postalCode').contents
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find(itemprop="postalCode").get_text()
shops.append(info)
更新:
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops = []
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find_all('span', attrs={"itemprop": "postalCode"})
for m in info:
if m:
m_text = m.get_text()
shops.append(m_text)
print (shops)
您的代码有一些问题。您正在使用不中断条件的无限循环。另外 shops= {}
是 dict
但您在其上使用 append
方法。
您可以使用 python-requests or urllib2 而不是 selenium
。
但是在你的代码中你可以这样做,
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops = []
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find('span', attrs={"itemprop": "postalCode"})
if info:
info_text = info.get_text()
shops.append(info_text)
print shops
在 Beautifulsoup 中,您可以通过它的属性找到标签,如下所示:
soup.find('span', attrs={"itemprop": "postalCode"})
同样,如果它没有找到任何东西,它将 return None
和 .get_text()
方法将引发 AttributeError
。所以在申请之前先检查 .get_text()
所以在玩了一会儿之后,我认为最好的方法不是使用 selenium。它需要使用 driver.back()
并等待元素重新出现,以及一大堆其他东西。我只使用 requests
、re
和 bs4
就能得到你想要的东西。 re
包含在 Python 标准库中,如果您还没有安装 requests
,您可以使用 pip 安装,如下所示:pip install requests
from bs4 import BeautifulSoup
import re
import requests
base_url = 'http://www.localstore.co.uk'
url = 'http://www.localstore.co.uk/stores/75061/dfs/'
res = requests.get(url)
soup = BeautifulSoup(res.text)
shops = []
links = soup.find_all('a', href=re.compile('.*\/store\/.*'))
for l in links:
full_link = base_url + l['href']
town = l['title'].split(',')[1].strip()
res = requests.get(full_link)
soup = BeautifulSoup(res.text)
info = soup.find('span', attrs={"itemprop": "postalCode"})
postalcode = info.text
shops.append(dict(town_name=town, postal_code=postalcode))
print shops
我试图获取 DFS 的邮政编码,为此我尝试获取每个商店的 href,然后单击它,下一页有商店位置,我可以从中获取邮政编码,但我可以为了让事情正常进行,我哪里出错了?
我尝试先获取上层属性 td.searchResults
,然后尝试点击 href with title DFS
并点击获取 postalCode。最终迭代所有三个页面。
如果有更好的方法,请告诉我。
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
while True:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops= {}
#info = soup.find('span', itemprop='postalCode').contents
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find(itemprop="postalCode").get_text()
shops.append(info)
更新:
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops = []
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find_all('span', attrs={"itemprop": "postalCode"})
for m in info:
if m:
m_text = m.get_text()
shops.append(m_text)
print (shops)
您的代码有一些问题。您正在使用不中断条件的无限循环。另外 shops= {}
是 dict
但您在其上使用 append
方法。
您可以使用 python-requests or urllib2 而不是 selenium
。
但是在你的代码中你可以这样做,
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops = []
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find('span', attrs={"itemprop": "postalCode"})
if info:
info_text = info.get_text()
shops.append(info_text)
print shops
在 Beautifulsoup 中,您可以通过它的属性找到标签,如下所示:
soup.find('span', attrs={"itemprop": "postalCode"})
同样,如果它没有找到任何东西,它将 return None
和 .get_text()
方法将引发 AttributeError
。所以在申请之前先检查 .get_text()
所以在玩了一会儿之后,我认为最好的方法不是使用 selenium。它需要使用 driver.back()
并等待元素重新出现,以及一大堆其他东西。我只使用 requests
、re
和 bs4
就能得到你想要的东西。 re
包含在 Python 标准库中,如果您还没有安装 requests
,您可以使用 pip 安装,如下所示:pip install requests
from bs4 import BeautifulSoup
import re
import requests
base_url = 'http://www.localstore.co.uk'
url = 'http://www.localstore.co.uk/stores/75061/dfs/'
res = requests.get(url)
soup = BeautifulSoup(res.text)
shops = []
links = soup.find_all('a', href=re.compile('.*\/store\/.*'))
for l in links:
full_link = base_url + l['href']
town = l['title'].split(',')[1].strip()
res = requests.get(full_link)
soup = BeautifulSoup(res.text)
info = soup.find('span', attrs={"itemprop": "postalCode"})
postalcode = info.text
shops.append(dict(town_name=town, postal_code=postalcode))
print shops