如何使用 python 仅从已解析的 html 获取链接?
How to get only links from parsed html using python?
如果标签采用这种形式,我如何获得 links?
<a href="/url?q=instagram.com/goinggourmet/… class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd">Going Gourmet Catering (@goinggourmet) - Instagram</div></h3><div class="BNeawe UPmit AP7Wnd">www.instagram.com › goinggourmet</div></a>
我试过下面的代码,它帮助我只获取了 URL,但 URL 是这种格式。
/url?q=https://bespokecatering.sydney/&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAEQAg&usg=AOvVaw076QI0_4Yw4hNZ6iXHQZL-
/url?q=https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/%3Fextid%3DSEO----&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQtwJ6BAgEEAE&usg=AOvVaw2YQI1Bqwip72axc-Nh2_6e
/url?q=https://www.instagram.com/bespoke_catering/%3Fhl%3Den&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAoQAg&usg=AOvVaw1QUCWYmxfSLb6Jx20hyXIR
我只需要来自 Facebook 和 Instagram 的 URL,没有任何额外的措辞,我的意思是我只想要真实的 link,而不是重定向的 link。
我需要 links,
上面这样的东西
'https://www.facebook.com/bespokecatering.sydney'
'https://www.instagram.com/bespoke_catering'
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
非常感谢任何帮助。
我尝试了下面的代码,但它 returns 结果为空或结果不同
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
for url in urls:
try:
j=url.split('=')[1]
k= '/'.join(j.split('/')[0:4])
#print(k)
except:
k = ''
您已经选择了 <a>
- 只需循环选择并通过 ['href']
:
打印结果
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(link['href'])
如果您改进问题并按要求添加其他信息,我们可以回答得更详细。
编辑
用一个简单的例子回答你的附加问题(你应该在你的问题中提供)
import requests
from bs4 import BeautifulSoup
result = '''
<div class="kCrYT">
<a href="/url?q=https://bespokecatering.sydney/&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAEQAg&usg=AOvVaw076QI0_4Yw4hNZ6iXHQZL-"></a>
</div>
<div class="kCrYT">
<a href="/url?q=https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/%3Fextid%3DSEO----&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQtwJ6BAgEEAE&usg=AOvVaw2YQI1Bqwip72axc-Nh2_6e"></a>
</div>
<div class="kCrYT">
<a href="/url?q=https://www.instagram.com/bespoke_catering/%3Fhl%3Den&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAoQAg&usg=AOvVaw1QUCWYmxfSLb6Jx20hyXIR"></a>
</div>
'''
soup = BeautifulSoup(result, 'lxml')
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(dict(x.split('=') for x in requests.utils.urlparse(link['href']).query.split('&'))['q'].split('%3F')[0])
结果:
https://bespokecatering.sydney/
https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/
https://www.instagram.com/bespoke_catering/
如果标签采用这种形式,我如何获得 links?
<a href="/url?q=instagram.com/goinggourmet/… class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd">Going Gourmet Catering (@goinggourmet) - Instagram</div></h3><div class="BNeawe UPmit AP7Wnd">www.instagram.com › goinggourmet</div></a>
我试过下面的代码,它帮助我只获取了 URL,但 URL 是这种格式。
/url?q=https://bespokecatering.sydney/&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAEQAg&usg=AOvVaw076QI0_4Yw4hNZ6iXHQZL-
/url?q=https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/%3Fextid%3DSEO----&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQtwJ6BAgEEAE&usg=AOvVaw2YQI1Bqwip72axc-Nh2_6e
/url?q=https://www.instagram.com/bespoke_catering/%3Fhl%3Den&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAoQAg&usg=AOvVaw1QUCWYmxfSLb6Jx20hyXIR
我只需要来自 Facebook 和 Instagram 的 URL,没有任何额外的措辞,我的意思是我只想要真实的 link,而不是重定向的 link。
我需要 links,
上面这样的东西'https://www.facebook.com/bespokecatering.sydney' 'https://www.instagram.com/bespoke_catering'
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
非常感谢任何帮助。
我尝试了下面的代码,但它 returns 结果为空或结果不同
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
for url in urls:
try:
j=url.split('=')[1]
k= '/'.join(j.split('/')[0:4])
#print(k)
except:
k = ''
您已经选择了 <a>
- 只需循环选择并通过 ['href']
:
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(link['href'])
如果您改进问题并按要求添加其他信息,我们可以回答得更详细。
编辑
用一个简单的例子回答你的附加问题(你应该在你的问题中提供)
import requests
from bs4 import BeautifulSoup
result = '''
<div class="kCrYT">
<a href="/url?q=https://bespokecatering.sydney/&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAEQAg&usg=AOvVaw076QI0_4Yw4hNZ6iXHQZL-"></a>
</div>
<div class="kCrYT">
<a href="/url?q=https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/%3Fextid%3DSEO----&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQtwJ6BAgEEAE&usg=AOvVaw2YQI1Bqwip72axc-Nh2_6e"></a>
</div>
<div class="kCrYT">
<a href="/url?q=https://www.instagram.com/bespoke_catering/%3Fhl%3Den&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAoQAg&usg=AOvVaw1QUCWYmxfSLb6Jx20hyXIR"></a>
</div>
'''
soup = BeautifulSoup(result, 'lxml')
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(dict(x.split('=') for x in requests.utils.urlparse(link['href']).query.split('&'))['q'].split('%3F')[0])
结果:
https://bespokecatering.sydney/ https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/ https://www.instagram.com/bespoke_catering/