使用 `find_all()` 获取同一标签子集的所有标签

Question

我正在尝试从 html 文档中查找特定类型的所有 <a> HTML 标签。

我的代码：

for i in top_url_list:
    r = requests.get(top_url_list[i])
    soup = BeautifulSoup(r.content)

此时我需要在 href 标签中提取（使用一些正则表达式）link 的一部分。

标签如下所示：

"<a href="/players/a/abdelal01.html">Alaa Abdelnaby</a>"

还有其他 <a href...> 标签不遵循此约定，我不想 find_all() 使用。

我可以传递什么 find_all() 来检索我需要处理的正确的 href 标签集？

Answer 1

There are other links on the page that don't follow that convention because they aren't links to player pages, they might be links to team pages and whatnot.

然后我会检查 href 是否以 /players:

开头

for link in soup.select('a[href^="/players"]'):
    print(link["href"])

或者，包含 players:

for link in soup.select('a[href*=players]'):
    print(link["href"])

因为您只对 html 文件名感兴趣，所以按 / 拆分并得到最后一项：

print(link["href"].split("/")[-1])

Answer 2

因为您需要的只是 href 标签本身的一部分，因此无需为此使用 Beautiful Soup 或 HTML 解析器。这个任务只需要页面源码和正则表达式就可以完成，如下图。

正则表达式匹配一个像abdelal01.html这样的字符串，它有字符、两个数字、一个句点和另一组字符。表达式本身作为 findall 函数的第一个参数传入，第二个参数是页面源。这是通过使用 urlopen() 方法，调用 read() 函数得到 HTML，然后将其转换为字符串格式以供正则表达式使用。

结果如下所示 - 它输出一个 href 标签列表，您可以遍历这些标签并将其附加到原始 URL。希望对您有所帮助！

from urllib.request import urlopen
import re

url = "http://www.basketball-reference.com/players/a/"
result = re.findall("\/([a-z]+[0-9][0-9]\W[a-z]+)", str(urlopen(url).read()))
print(result)

输出：

['abdelal01.html', 'abdulza01.html', 'abdulka01.html', 'abdulma02.html', 'abdulta01.html', 'abdursh01.html', 'abernto01.html', 'able
fo01.html', 'abramjo01.html', 'ackeral01.html', 'ackerdo01.html', 'acresma01.html', 'actonbu01.html', 'acyqu01.html', 'adamsal01.htm
l', 'adamsdo01.html', 'adamsge01.html', 'adamsha01.html', 'adamsjo01.html'...]

使用 `find_all()` 获取同一标签子集的所有标签

Using `find_all()` to get all tags of a subset of that same tag

python

screen-scraping

beautifulsoup