如何从具有 beautifulsoup4 的网页中仅提取特定种类的 link

Question

我正在尝试在充满链接的页面上提取特定链接。我需要的链接中包含单词 "apartment"。

但无论我尝试什么，我提取的数据远远多于我需要的链接。

<a href="https://www.website.com/en/ad/apartment/abcd123" title target="IWEB_MAIN">

如果有人能帮我解决这个问题，我将不胜感激！另外，如果你有一个好的消息来源可以更好地告诉我这方面的信息，我将不胜感激！

Answer 1

你可以使用正则表达式 re.

import re
soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.find_all("a",attrs={"href" : re.compile("apartment")})
for item in alltags:
    print(item['href']) #grab href value

或者您可以使用 css 选择器

soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.select("a[href*='apartment']")
for item in alltags:
    print(item['href'])

你在官方文档中找到详细信息Beautifulsoup

已编辑:

你需要先考虑父div再找到锚标签。

import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select("div[data-type='resultgallery-resultitem'] >a[href*='apartment']"):
       print(item['href'])

如何从具有 beautifulsoup4 的网页中仅提取特定种类的 link

How to extract only a specific kind of link from a webpage with beautifulsoup4

python

screen-scraping

beautifulsoup

web-scraping