检索内容匹配搜索的第一个 HTML 标签

Question

使用：bs4，Python3.9，lxml

假设我有一些 HTML 这样的：

<div>
    <a href="google.com">Item 3</a>
    <a href="facebook.com">Item 3</a>
</div>

我想找到单词 Item 3 的第一次出现，并获取特定的 <a> 标记和它指向的 link。我该怎么做？谢谢！

Answer 1

使用 .find() 方法将 return 它找到的第一个实例。因此，只需查找具有给定文本的 <a> 标签，并提取 href 属性：

from bs4 import BeautifulSoup


html = '''<div>
    <a href="google.com">Item 3</a>
    <a href="facebook.com">Item 3</a>
</div>'''


soup = BeautifulSoup(html, 'html.parser')
item3 = soup.find('a', text='Item 3')['href']

输出：

print (item3)
google.com

Answer 2

您可以将 .find 和 text= 属性与 lambda 一起使用：

from bs4 import BeautifulSoup

html_doc = """
<div>
    <a href="google.com">Item 3</a>
    <a href="facebook.com">Item 3</a>
</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")
to_search = "Item 3"

tag = soup.find(text=lambda t: to_search in t).parent
print(tag)

打印：

<a href="google.com">Item 3</a>

或者：使用 CSS 选择器：

a = soup.select_one('a:-soup-contains("Item 3")')
print(a)
print(a["href"])

打印：

<a href="google.com">Item 3</a>
google.com

Answer 3

你可以使用 xpath:

from lxml import etree
root = etree.fromstring(html_doc)
e = root.xpath('.//a[text()="TEXT B"]')

输出：

print(e.text)
TEXT B

检索内容匹配搜索的第一个 HTML 标签

Retrieve the first HTML tag whos content matches search

python

lxml

beautifulsoup

web-scraping