使用 Beautiful Soup 获取 href

Question

我正在尝试为此 html 代码提取特定的 link

<a class="pageNum taLnk" data-offset="10" data-page-number="1" 
href="www.blahblahblah.com/bb32123">Page 1 </a>
<a class="pageNum taLnk" data-offset="20" data-page-number="2" 
href="www.blahblahblah.com/bb45135">Page 2 </a>

如您所见，link (href) 杂乱无章，因此没有可供我使用的模式，这意味着我需要使用 BeautifulSoup.[=14 手动提取 href =]

我想专门获取第 2 页的 href。

这些可以是我现在的代码。

 from bs4 import BeautifulSoup
 import urllib

 url = 'https://www.tripadvisor.com/ShowUserReviews-g293917-d539542-r447460956-Duangtawan_Hotel_Chiang_Mai-Chiang_Mai.html#REVIEWS'
 page = urllib.request.urlopen(url)
 soup = BeautifulSoup(page, 'html.parser')
 for link in soup.find_all('a', attrs = {'class' : 'pageNum taLnk'}):
     print (link)

如您所见，我一直在尝试获取专门针对第 2 页的 href 信息。无论如何，是否可以使用 data-page-number = "2" 或 [=13= 等标签中的额外信息来访问].

Answer 1

page_2 = soup.find('a', attrs = {'data-page-number' : '2'})

这只会给你第 2 页，如果你想得到下一页，不管当前页是什么，你应该找到下一页 url:

next_page = soup.find('a', attrs = {'class' = 'nav next rndBtn ui_button primary taLnk'})

Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

使用 Beautiful Soup 获取 href

Getting href using Beautiful Soup

python

urllib

beautifulsoup