BeautifulSoup

Question

我正在使用 BeautifulSoup 抓取一个 HTML 页面并寻找 select 基于数组键而不是元素标记的字符串。

在这种情况下，我希望使用 "fmt_headline" 作为获取 "Founder and CEO at SolarThermoChemical LLC" 的关键。

<div id="srp_main_" class="">
<code id="voltron_srp_main-content" style="display:none;">

"fmt_headline":"Founder and CEO at SolarThermoChemical LLC",
"isConnectedEnabled":true,
"sharedConnectionToken":"240506fce660"

</div>

关于如何做到这一点有什么想法吗？

Answer 1

用 BeautifulSoup 解析 HTML 后，它可以为您提供所有文本：

2>>> x
'<div id="srp_main_" class="">\n<code id="voltron_srp_main-content" style="display:none;">\n\n"fmt_headline":"Founder and CEO at SolarThermoChemical LLC",\n"isConnectedEnabled":true,\n"sharedConnectionToken":"240506fce660"\n\n</div>'
2>>> soup=bs4.BeautifulSoup(x)
2>>> y=soup.get_text()
2>>> y
u'\n\n\n"fmt_headline":"Founder and CEO at SolarThermoChemical LLC",\n"isConnectedEnabled":true,\n"sharedConnectionToken":"240506fce660"\n\n'

现在，对该文本的进一步分析留给其他工具，例如正则表达式：

2>>> import re
2>>> mo = re.search(r'"fmt_headline":"([^"]*)"', y)
2>>> print(mo.group(1))
Founder and CEO at SolarThermoChemical LLC

Answer 2

您想要的数据在html的评论部分。所以你需要先提取评论。

from bs4 import BeautifulSoup, Comment

tag = soup.find('code', attrs={'id': "voltron_srp_main-content"})
tag_comments = tag.find_all(text=lambda text: isinstance(text, Comment))

现在，您可以将 tag_comments 格式解析为 json（看起来像 json）或使用 Alex Martelli 的回答中所示的正则表达式。

BeautifulSoup - Select 基于字典键的字符串

BeautifulSoup - Select String Based on Dictionary Key

html

python

scrape