使用 Beautfiul Soup 从 blogspot 网站中提取特定的链接组

Question

我想在学校网站上每 7 年 link 提取一次。在档案中，使用 ctrl + f "year-7" 很容易找到。不过，在 beautifulSoup 上并不那么容易。或者我做错了。

import requests
from bs4 import BeautifulSoup

URL = '~school URL~'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

for link in soup.find_all('a'):
    print(link.get('href'))

这给了我网站存档上的每个 link。每个对我来说很重要的 link 都差不多是这样的：

~school URL~blogspot.com/2020/10/mathematics-activity-year-x.html

我尝试将“(link.get('href'))”存储在变量上并在其上搜索“year-x”，但这不起作用。

关于如何搜索它的任何想法？ Blogspot 搜索是可怕的。我这样做是为了帮助贫困地区的孩子们更轻松地找到他们的类，因为它只是留在下一学年的网站上，并且有数百个 link 没有标签针对不同的学年。我试图至少为每个学年编制一份 link 的列表来帮助他们。

Answer 1

据我了解，您想从链接中提取年份。尝试使用 regex 提取年份。

你的情况是：

import re
from bs4 import BeautifulSoup

txt = """<a href="blogspot.com/2020/10/mathematics-activity-year-x.html"</a>"""
soup = BeautifulSoup(txt, "html.parser")

years = []

for tag in soup.find_all("a"):
    link = tag.get("href")
    year = re.search(r"year-.?", link).group()
    years.append(year)

print(years)

输出：

['year-x']

编辑尝试使用 CSS select 或 select 所有以 [=14 结尾的 href =]

...
for tag in soup.select('a[href$="year-7.html"]'):
        print(tag)

使用 Beautfiul Soup 从 blogspot 网站中提取特定的链接组

Using Beautfiul Soup to extract specific groups of links from a blogspot website

python

automation

beautifulsoup