如何通过BeautifulSoup获取主页中的特定文本超链接？

Question

我想搜索https://www.geeksforgeeks.org/中所有文本名称包含“文章”的超链接例如，在这个网页的底部

Write an Article
Improve an Article

我想得到它们所有的超链接并打印它们，所以我尝试了，

from urllib.request import urlopen
from bs4 import BeautifulSoup
import os
import re

url = 'https://www.geeksforgeeks.org/'

reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
links = []
for link in soup.findAll('a',href = True):
    #print(link.get("href")

    if re.search('/article$', href):
        links.append(link.get("href"))

但是结果是[]，如何解决？

Answer 1

您可以尝试以下方法：请注意，在您提供的 link 中有更多 link 与测试 article，但它给出了如何处理此问题的想法。

在这种情况下，我只是检查了单词 article 是否在该标签的文本中。您可以在那里使用正则表达式搜索，但对于这个例子来说，这是一个大材小用。

import requests
from bs4 import BeautifulSoup

url = 'https://www.geeksforgeeks.org/'
res = requests.get(url)

if res.status_code != 200:
    'no resquest'

soup = BeautifulSoup(res.content, "html.parser")

links_with_article = soup.findAll(lambda tag:tag.name=="a" and "article" in tag.text.lower())

编辑：

如果您知道 href 中有一个词，即 link 本身：

soup.select("a[href*=article]")

这将在所有元素 a 的 href 中搜索单词 article。

编辑：仅获取 href：

hrefs = [link.get('href') for link in links_with_article]

如何通过BeautifulSoup获取主页中的特定文本超链接？

How to get specific text hyperlinks in the home webpage by BeautifulSoup?

python

beautifulsoup

python-3.x

python-requests