我正在尝试抓取一个网站的链接,并在已经抓取的链接中抓取链接
I am trying to scrape a website for links and also scrape the links inside the already scraped links
我正在尝试 抓取 一个网站的链接,在抓取之后,我还想看看我抓取的链接是文章还是包含更多链接,如果是,我也想 抓取 这些链接。我正在尝试使用 BeautifulSoup 4 来实现它,这是我目前所拥有的代码:
import requests
from bs4 import BeautifulSoup
url ='https://www.lbbusinessjournal.com/'
try:
r = requests.get(url, headers={'User-Agent': user_agent})
soup = BeautifulSoup(r.text, 'html.parser')
for post in soup.find_all(['h3', 'li'], class_=['entry-title td-module-title', 'menu-item']):
link = post.find('a').get('href')
print(link)
r = requests.get(link, headers={'User-Agent': user_agent})
soup1 = BeautifulSoup(r.text, 'html.parser')
for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
link1 = post1.find('a').get('href')
print(link1)
except Exception as e:
print(e)
我也想要 https://www.lbbusinessjournal.com/ and scrape for possible links inside the links that I get from that page for example https://www.lbbusinessjournal.com/news/, I want the links inside https://www.lbbusinessjournal.com/news/ 页面上的链接。到目前为止,我只从主页获得链接。
从你的 except
子句中尝试 raise e
,你会看到错误
AttributeError: 'NoneType' object has no attribute 'get'
源自行 link1 = post1.find('a').get('href')
,其中 post1.find('a')
returns None
- 这是因为 HTML h3
中至少有一个您检索到的元素没有 a
元素 - 实际上,link 似乎在 HTML.
中被注释掉了
相反,您应该将此 post1.find('a').get('href')
调用分成两步,并在尝试获取 'href'
属性之前检查 post1.find('a')
返回的元素是否不是 None
, 即:
for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
element = post1.find('a')
if element is not None:
link1 = element.get('href')
print(link1)
运行 您的代码的输出具有此更改:
https://www.lbbusinessjournal.com/
https://www.lbbusinessjournal.com/this-virus-doesnt-have-borders-port-official-warns-of-pandemics-future-economic-impact/
https://www.lbbusinessjournal.com/pharmacy-and-grocery-store-workers-call-for-increased-protections-against-covid-19/
https://www.lbbusinessjournal.com/up-close-and-personal-grooming-businesses-struggle-in-times-of-social-distancing/
https://www.lbbusinessjournal.com/light-at-the-end-of-the-tunnel-long-beach-secures-contract-for-new-major-convention/
https://www.lbbusinessjournal.com/hospitals-prepare-for-influx-of-coronavirus-patients-officials-worry-it-wont-be-enough/
https://www.lbbusinessjournal.com/portside-keeping-up-with-the-port-of-long-beach-18/
https://www.lbbusinessjournal.com/news/
...
我正在尝试 抓取 一个网站的链接,在抓取之后,我还想看看我抓取的链接是文章还是包含更多链接,如果是,我也想 抓取 这些链接。我正在尝试使用 BeautifulSoup 4 来实现它,这是我目前所拥有的代码:
import requests
from bs4 import BeautifulSoup
url ='https://www.lbbusinessjournal.com/'
try:
r = requests.get(url, headers={'User-Agent': user_agent})
soup = BeautifulSoup(r.text, 'html.parser')
for post in soup.find_all(['h3', 'li'], class_=['entry-title td-module-title', 'menu-item']):
link = post.find('a').get('href')
print(link)
r = requests.get(link, headers={'User-Agent': user_agent})
soup1 = BeautifulSoup(r.text, 'html.parser')
for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
link1 = post1.find('a').get('href')
print(link1)
except Exception as e:
print(e)
我也想要 https://www.lbbusinessjournal.com/ and scrape for possible links inside the links that I get from that page for example https://www.lbbusinessjournal.com/news/, I want the links inside https://www.lbbusinessjournal.com/news/ 页面上的链接。到目前为止,我只从主页获得链接。
从你的 except
子句中尝试 raise e
,你会看到错误
AttributeError: 'NoneType' object has no attribute 'get'
源自行 link1 = post1.find('a').get('href')
,其中 post1.find('a')
returns None
- 这是因为 HTML h3
中至少有一个您检索到的元素没有 a
元素 - 实际上,link 似乎在 HTML.
相反,您应该将此 post1.find('a').get('href')
调用分成两步,并在尝试获取 'href'
属性之前检查 post1.find('a')
返回的元素是否不是 None
, 即:
for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
element = post1.find('a')
if element is not None:
link1 = element.get('href')
print(link1)
运行 您的代码的输出具有此更改:
https://www.lbbusinessjournal.com/
https://www.lbbusinessjournal.com/this-virus-doesnt-have-borders-port-official-warns-of-pandemics-future-economic-impact/
https://www.lbbusinessjournal.com/pharmacy-and-grocery-store-workers-call-for-increased-protections-against-covid-19/
https://www.lbbusinessjournal.com/up-close-and-personal-grooming-businesses-struggle-in-times-of-social-distancing/
https://www.lbbusinessjournal.com/light-at-the-end-of-the-tunnel-long-beach-secures-contract-for-new-major-convention/
https://www.lbbusinessjournal.com/hospitals-prepare-for-influx-of-coronavirus-patients-officials-worry-it-wont-be-enough/
https://www.lbbusinessjournal.com/portside-keeping-up-with-the-port-of-long-beach-18/
https://www.lbbusinessjournal.com/news/
...