从 href 抓取数据

Question

我试图获取 DFS 的邮政编码，为此我尝试获取每个商店的 href，然后单击它，下一页有商店位置，我可以从中获取邮政编码，但我可以为了让事情正常进行，我哪里出错了？我尝试先获取上层属性 td.searchResults，然后尝试点击 href with title DFS 并点击获取 postalCode。最终迭代所有三个页面。如果有更好的方法，请告诉我。

 driver = webdriver.Firefox()
    driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
    html = driver.page_source
    soup = BeautifulSoup(html)
    listings = soup.select('td.searchResults')
    for l in listings:
         while True:      
              driver.find_element_by_css_selector("a[title*='DFS']").click()
              shops= {}
              #info = soup.find('span', itemprop='postalCode').contents
              html = driver.page_source
              soup = BeautifulSoup(html)
              info = soup.find(itemprop="postalCode").get_text()
              shops.append(info)

更新：

driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')

for l in listings:
    driver.find_element_by_css_selector("a[title*='DFS']").click()
    shops = []
    html = driver.page_source
    soup = BeautifulSoup(html)
    info = soup.find_all('span', attrs={"itemprop": "postalCode"})
    for m in info:
        if m:
           m_text = m.get_text()
           shops.append(m_text)
    print (shops)

Answer 1

您的代码有一些问题。您正在使用不中断条件的无限循环。另外 shops= {} 是 dict 但您在其上使用 append 方法。您可以使用 python-requests or urllib2 而不是 selenium。

但是在你的代码中你可以这样做，

driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')

for l in listings:
    driver.find_element_by_css_selector("a[title*='DFS']").click()
    shops = []
    html = driver.page_source
    soup = BeautifulSoup(html)
    info = soup.find('span', attrs={"itemprop": "postalCode"})
    if info:
        info_text = info.get_text()
        shops.append(info_text)
    print shops

在 Beautifulsoup 中，您可以通过它的属性找到标签，如下所示：

soup.find('span', attrs={"itemprop": "postalCode"})

同样，如果它没有找到任何东西，它将 return None 和 .get_text() 方法将引发 AttributeError。所以在申请之前先检查 .get_text()

Answer 2

所以在玩了一会儿之后，我认为最好的方法不是使用 selenium。它需要使用 driver.back() 并等待元素重新出现，以及一大堆其他东西。我只使用 requests、re 和 bs4 就能得到你想要的东西。 re 包含在 Python 标准库中，如果您还没有安装 requests，您可以使用 pip 安装，如下所示：pip install requests

from bs4 import BeautifulSoup
import re
import requests

base_url = 'http://www.localstore.co.uk'
url = 'http://www.localstore.co.uk/stores/75061/dfs/'
res = requests.get(url)
soup = BeautifulSoup(res.text)

shops = []

links = soup.find_all('a', href=re.compile('.*\/store\/.*'))

for l in links:
    full_link = base_url + l['href']
    town = l['title'].split(',')[1].strip()
    res = requests.get(full_link)
    soup = BeautifulSoup(res.text)
    info = soup.find('span', attrs={"itemprop": "postalCode"})
    postalcode = info.text
    shops.append(dict(town_name=town, postal_code=postalcode))

print shops

从 href 抓取数据

Scraping data from href

python

selenium

beautifulsoup

web-scraping