使用 Python 的 BeautifulSoup 提取包含特定子字符串的 'a' 标签

Question

使用 BeautifulSoup，我只想 return 在其 href 字符串中包含 "Company" 而不是 "Sector" 的 "a" 标签。有没有办法在 re.compile() 中使用正则表达式来 return 只有公司而不是部门？

代码：

soup = soup.findAll('tr')[5].findAll('a') print(soup)

输出

[<a class="example" href="../ref/index.htm">Example</a>,  
<a href="?Company=FB">Facebook</a>,  
<a href="?Company=XOM">Exxon</a>,  
<a href="?Sector=5">Technology</a>,  
<a href="?Sector=3">Oil & Gas</a>]

使用此方法：

import re soup.findAll('a', re.compile("Company"))

Returns:

AttributeError: 'ResultSet' object has no attribute 'findAll'

但我希望它 return（没有扇区）：

[<a href="?Company=FB">Facebook</a>,<br> <a href="?Company=XOM">埃克森美孚</a>]

使用：

Urllib.request 版本：3.5
BeautifulSoup 版本：4.4.1
Pandas版本：0.17.1
Python 3

Answer 1

使用 soup = soup.findAll('tr')[5].findAll('a') 然后 soup.findAll('a', re.compile("Company")) 覆盖原来的 soup 变量。 findAll returns 一个基本上是 BeautifulSoup 个对象数组的结果集。尝试使用以下方法获取所有 "Company" 链接。

links = soup.findAll('tr')[5].findAll('a', href=re.compile("Company"))

获取这些标签中包含的文本：

companies = [link.text for link in links]

Answer 2

您可以使用 css 选择器 获取 href 以 ?Company:

开头的所有 a 标签

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

a = soup.select("a[href^=?Company]")

如果你只需要第六个 tr 中的它们，你可以使用 nth-of-type:

 .select("tr:nth-of-type(6) a[href^=?Company]"))

Answer 3

感谢@Padriac Cunningham 和@Wyatt I 的上述回答！！这是我想出的一个不太优雅的解决方案：

import re
for i in range(1, len(soup)):
    if re.search("Company" , str(soup[i])):
        print(soup[i])

Answer 4

另一种方法是 xpath，它支持 AND/NOT 操作以按 XML 文档中的属性进行查询。不幸的是，BeautifulSoup 本身不处理 xpath，但 lxml 可以：

from lxml.html import fromstring
import requests

r = requests.get("YourUrl")
tree = fromstring(r.text)
#get elements with company in the URL but excludes ones with Sector
a_tags = tree.xpath("//a[contains(@href,'?Company') and not(contains(@href, 'Sector'))]")

使用 Python 的 BeautifulSoup 提取包含特定子字符串的 'a' 标签

Extracting 'a' tags containing specific substring with Python's BeautifulSoup

python

tags

beautifulsoup

recompile

web-scraping