使用 BeautifulSoup 仅在给定其标题的子字符串时查找 h3
Using BeautifulSoup to find h3 when only given substring of its title
我正在尝试从 Jeopardy 网站收集数据。特别是,我想从这个 site:
的 table 数据中收集美元金额
在lxml中是这样显示的:
我可以用下面的代码行来做到这一点:
scores = [int(score.text.replace('$','').replace(',','')) for score in soupEpisode.find('h3', string='Scores at the first commercial break (after clue 15)').findNext('table').find_all('tr')[1].find_all('td')]
但是,有时 table 的显示略有不同(“16”而不是“15”),如下所示:
因此,我的代码部分
soupEpisode.find('h3', string='Scores at the first commercial break (after clue 15)')
将return"None"。有没有办法只使用 h3 名称的子字符串来执行 find 方法?如果我只需要 "Scores at the first commercial break" 子字符串就可以编写同一行代码,我相信它适用于所有情况。谢谢!
编辑:
要进行测试,请下载 this site 的 html 版本,下面的代码片段应该可以运行:
from bs4 import BeautifulSoup
def main():
#episode_file should be "8062.html"
episode = open(episode_file, encoding="utf-8")
soupEpisode = BeautifulSoup(episode, 'lxml')
episode.close()
first_commercial_break = [int(score.text.replace('$','').replace(',','')) for score in soupEpisode.find('h3', string=string='Scores at the first commercial break (after clue 15)').findNext('table').find_all('tr')[1].find_all('td')]
return first_commercial_break
试试这个代码。它找到包含 'Scores at the first commercial break' 的 h3,然后找到 h3.
下面的 table
from bs4 import BeautifulSoup
from urllib.request import urlopen
html_content = urlopen('http://www.j-archive.com/showgame.php?game_id=6432')
soup = BeautifulSoup(html_content, "lxml")
for h3 in soup.find_all('h3'):
if 'Scores at the first commercial break' in h3.text:
new_html_content = str(soup).split(str(h3))[1]
soup = BeautifulSoup(new_html_content, "lxml")
name_list = [td.text for td in soup.find('table').find('tr').find_all('td')]
dollar_list = [td.text for td in soup.find('table').find_all('tr')[1].find_all('td')]
print(name_list)
print(dollar_list)
打印结果如下
['Kevin', 'Julie', 'Bill']
[',800', '[=11=]', ',200']
我正在尝试从 Jeopardy 网站收集数据。特别是,我想从这个 site:
的 table 数据中收集美元金额在lxml中是这样显示的:
我可以用下面的代码行来做到这一点:
scores = [int(score.text.replace('$','').replace(',','')) for score in soupEpisode.find('h3', string='Scores at the first commercial break (after clue 15)').findNext('table').find_all('tr')[1].find_all('td')]
但是,有时 table 的显示略有不同(“16”而不是“15”),如下所示:
因此,我的代码部分
soupEpisode.find('h3', string='Scores at the first commercial break (after clue 15)')
将return"None"。有没有办法只使用 h3 名称的子字符串来执行 find 方法?如果我只需要 "Scores at the first commercial break" 子字符串就可以编写同一行代码,我相信它适用于所有情况。谢谢!
编辑:
要进行测试,请下载 this site 的 html 版本,下面的代码片段应该可以运行:
from bs4 import BeautifulSoup
def main():
#episode_file should be "8062.html"
episode = open(episode_file, encoding="utf-8")
soupEpisode = BeautifulSoup(episode, 'lxml')
episode.close()
first_commercial_break = [int(score.text.replace('$','').replace(',','')) for score in soupEpisode.find('h3', string=string='Scores at the first commercial break (after clue 15)').findNext('table').find_all('tr')[1].find_all('td')]
return first_commercial_break
试试这个代码。它找到包含 'Scores at the first commercial break' 的 h3,然后找到 h3.
下面的 tablefrom bs4 import BeautifulSoup
from urllib.request import urlopen
html_content = urlopen('http://www.j-archive.com/showgame.php?game_id=6432')
soup = BeautifulSoup(html_content, "lxml")
for h3 in soup.find_all('h3'):
if 'Scores at the first commercial break' in h3.text:
new_html_content = str(soup).split(str(h3))[1]
soup = BeautifulSoup(new_html_content, "lxml")
name_list = [td.text for td in soup.find('table').find('tr').find_all('td')]
dollar_list = [td.text for td in soup.find('table').find_all('tr')[1].find_all('td')]
print(name_list)
print(dollar_list)
打印结果如下
['Kevin', 'Julie', 'Bill']
[',800', '[=11=]', ',200']