如何在 BeautifulSoup 中获取搜索上下文？

Question

我正在解析由各种 HTML 实体组成的网页，其中包括以下片段：

<p style="text-align: center;"><img src="http://example.com/smthg.png" alt="thealtttext" /></p>
<p style="text-align: center;"><strong>My keywords : <a href="http://example.com/hello.html" target="_blank"> some text </a> </strong></p>
<p style="text-align: center;"><strong>some other words : <a href="http://example.com/anotherlink.html" target="_blank"> some other words</a></strong></p>

我对 My keywords 之后的 URL 感兴趣（上例中的 http://example.com/hello.html）。 My keywords 和后面的 link 的组合在页面中是唯一的。

现在我使用正则表达式提取 URL:

import requests
import re

def getfile(link):
    r = requests.get(link).text

    try:
        link = re.search('My keyword : <a href="(.+)" target', r).group(1)
    except AttributeError:
        print("no direct link for {link}".format(link=link))
    else:
        return link

 print(getfile('http://example.com'))

此方法在工作时非常依赖于匹配字符串的确切格式。我非常愿意使用 BeautifulSoup 来：

搜索 My keyword
获取它的上下文（我指的是包含该字符串的标签的整个值，在上面的例子中是 My keywords : <a href="http://example.com/hello.html" target="_blank"> some text </a>）
运行再次通过 BeautifulSoup 以提取 <a>

我在第二部分失败了，我只得到

[u'My keywords : ']

尝试字符串搜索时

import bs4
import re

thehtml = '''
    <p style="text-align: center;"><img src="http://example.com/smthg.png" alt="thealtttext" /></p>
    <p style="text-align: center;"><strong>My keywords : <a href="http://example.com/hello.html" target="_blank"> some text </a> </strong></p>
    <p style="text-align: center;"><strong>some other words : <a href="http://example.com/anotherlink.html" target="_blank"> some other words</a></strong></p>
    '''
soup = bs4.BeautifulSoup(thehtml)
k = soup.find_all(text=re.compile("My keywords"))
print(k)

如何获取周围标签的全部内容？（我不能假设这将始终如上例所示 <strong>）

Answer 1

您可以使用 find() 而不是 find_all() 因为只有一个匹配项。然后用next_sibling找到<a>标签，用href得到它的值，例子：

import bs4 
import re

thehtml = ''' 
    <p style="text-align: center;"><img src="http://example.com/smthg.png" alt="thealtttext" /></p>
    <p style="text-align: center;"><strong>My keywords : <a href="http://example.com/hello.html" target="_blank"> some text </a> </strong></p>
    <p style="text-align: center;"><strong>some other words : <a href="http://example.com/anotherlink.html" target="_blank"> some other words</a></strong></p>
    '''
soup = bs4.BeautifulSoup(thehtml)
k = soup.find(text=re.compile("My keywords")).next_sibling['href']
print(k)

产量：

http://example.com/hello.html

UPDATE：基于注释，要获取包含一些文本的元素，请使用 parent，如：

k = soup.find(text=re.compile("My keywords")).parent.text

产生：

<strong>My keywords : <a href="http://example.com/hello.html" target="_blank"> some text </a> </strong>

如何在 BeautifulSoup 中获取搜索上下文？

how to get the context of a search in BeautifulSoup?

html

python

beautifulsoup