我正在尝试抓取一个网站的链接，并在已经抓取的链接中抓取链接

Question

我正在尝试抓取一个网站的链接，在抓取之后，我还想看看我抓取的链接是文章还是包含更多链接，如果是，我也想抓取这些链接。我正在尝试使用 BeautifulSoup 4 来实现它，这是我目前所拥有的代码：

import requests
from bs4 import BeautifulSoup
url ='https://www.lbbusinessjournal.com/'
try:
    r = requests.get(url, headers={'User-Agent': user_agent})
    soup = BeautifulSoup(r.text, 'html.parser')
    for post in soup.find_all(['h3', 'li'], class_=['entry-title td-module-title', 'menu-item']):
        link = post.find('a').get('href')
        print(link)
        r = requests.get(link, headers={'User-Agent': user_agent})
        soup1 = BeautifulSoup(r.text, 'html.parser')
        for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
            link1 = post1.find('a').get('href')
            print(link1)
except Exception as e:
    print(e)

我也想要 https://www.lbbusinessjournal.com/ and scrape for possible links inside the links that I get from that page for example https://www.lbbusinessjournal.com/news/, I want the links inside https://www.lbbusinessjournal.com/news/ 页面上的链接。到目前为止，我只从主页获得链接。

Answer 1

从你的 except 子句中尝试 raise e，你会看到错误

AttributeError: 'NoneType' object has no attribute 'get'

源自行 link1 = post1.find('a').get('href')，其中 post1.find('a') returns None - 这是因为 HTML h3 中至少有一个您检索到的元素没有 a 元素 - 实际上，link 似乎在 HTML.

中被注释掉了

相反，您应该将此 post1.find('a').get('href') 调用分成两步，并在尝试获取 'href' 属性之前检查 post1.find('a') 返回的元素是否不是 None , 即:

for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):                                                     
    element = post1.find('a')                                           
    if element is not None:                                             
        link1 = element.get('href')                                     
        print(link1)

运行您的代码的输出具有此更改：

https://www.lbbusinessjournal.com/
https://www.lbbusinessjournal.com/this-virus-doesnt-have-borders-port-official-warns-of-pandemics-future-economic-impact/
https://www.lbbusinessjournal.com/pharmacy-and-grocery-store-workers-call-for-increased-protections-against-covid-19/
https://www.lbbusinessjournal.com/up-close-and-personal-grooming-businesses-struggle-in-times-of-social-distancing/
https://www.lbbusinessjournal.com/light-at-the-end-of-the-tunnel-long-beach-secures-contract-for-new-major-convention/
https://www.lbbusinessjournal.com/hospitals-prepare-for-influx-of-coronavirus-patients-officials-worry-it-wont-be-enough/
https://www.lbbusinessjournal.com/portside-keeping-up-with-the-port-of-long-beach-18/
https://www.lbbusinessjournal.com/news/
...

我正在尝试抓取一个网站的链接，并在已经抓取的链接中抓取链接

I am trying to scrape a website for links and also scrape the links inside the already scraped links

python

beautifulsoup

python-requests-html