抓取网站时如何筛选各种 'a' 标签?
How can I sift through various 'a' tags when scraping a website?
我正在尝试抓取 athletic.net,一个存储田径比赛时间的网站,以获取每个赛季的给定运动员、他们 运行 参加的每场比赛以及每场比赛的名单他们为每个事件获得的时间。
到目前为止,我已经打印了赛季名称和每个赛事的名称。我现在正试图在 a
标签的海洋中筛选以找到时间。我试过使用 find_next('a')
和 find_next_sibling('a')
但我正在努力隔离时间。
for text in soup.find_all('h5'):
#print season titles and event name neatly
if "Season" in str(text):
text_file.write(('\n' + '\n' + str(text.contents[0])) + '\n')
else:
text_file.write(str(text.contents[0]) + '\n')
#print all siblings
for i in range(0,100):
try:
text = text.find_next_sibling()
text_file.write(str(text) + '\n')
except:
print("miss")
到目前为止,我所能做的就是打印所有兄弟姐妹,其中包含所有时间。例如:
<table class="table table-sm table-responsive table-hover"><tbody><tr id="rID_162222827"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(162222827)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>9 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/AWi088nH1S1rZxdSN">2:10.97</a></td><td class="text-nowrap" style="width: 60px;">Mar 4</td><td><a href="meet/443782#53587">Sunset Invitational</a></td><td class="text-muted text-right text-nowrap">O F</td></tr><tr id="rID_164098252"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164098252)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>60 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/R3iEDYqsnSQ48l0h8">2:05.56</a><a href="/post/R3iEDYqsnSQ48l0h8" rel="nofollow"><small class="text-muted pr-text" style="font-weight:normal; margin-left: 4px;" uib-tooltip="Personal Record">PR</small></a></td><td class="text-nowrap" style="width: 60px;">Mar 19</td><td><a href="meet/441280#53587">Dublin Distance Fiesta</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_164212389"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164212389)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>3 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/16ik6rkINSN6VJEHy">2:18.54</a></td><td class="text-nowrap" style="width: 60px;">Mar 26</td><td><a href="meet/459101#53587">PSAL League Meet #1</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_174827223"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(174827223)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>26 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/gBivaaKIVSZRLp2UA">2:10.58</a></td><td class="text-nowrap" style="width: 60px;">Apr 9</td><td><a href="meet/443768#53587">Cupertino/De Anza Invite</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_168470829"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(168470829)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>50 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/vOi3ydru3SxDoBWso">2:13.20</a></td><td class="text-nowrap" style="width: 60px;">Apr 16</td><td><a href="meet/445132#53587">Granada Distance & Sprint Festival</a></td><td class="text-muted text-right text-nowrap">O F</td></tr></tbody></table>
此输出包含该运动员在最近一个赛季的一项赛事的所有时间。
当存在不包含时间的各种 a
标签时,如何筛选以仅隔离时间?
如果我使用 find_next_sibling('a')
它只会打印 None
.
问题需要一些改进,重点应该提供预期的输出,不太清楚。
How can I sift through to isolate only the times when there are various 'a' tags that don't contain times?
您可以使用 css selectors
获得所有 <a>
时间:
soup.select('tr [href^="/result"]')
或更具体
soup.select('tr td:nth-of-type(2) [href^="/result"]')
例子
from bs4 import BeautifulSoup
html = '''<table class="table table-sm table-responsive table-hover"><tbody><tr id="rID_162222827"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(162222827)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>9 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/AWi088nH1S1rZxdSN">2:10.97</a></td><td class="text-nowrap" style="width: 60px;">Mar 4</td><td><a href="meet/443782#53587">Sunset Invitational</a></td><td class="text-muted text-right text-nowrap">O F</td></tr><tr id="rID_164098252"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164098252)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>60 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/R3iEDYqsnSQ48l0h8">2:05.56</a><a href="/post/R3iEDYqsnSQ48l0h8" rel="nofollow"><small class="text-muted pr-text" style="font-weight:normal; margin-left: 4px;" uib-tooltip="Personal Record">PR</small></a></td><td class="text-nowrap" style="width: 60px;">Mar 19</td><td><a href="meet/441280#53587">Dublin Distance Fiesta</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_164212389"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164212389)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>3 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/16ik6rkINSN6VJEHy">2:18.54</a></td><td class="text-nowrap" style="width: 60px;">Mar 26</td><td><a href="meet/459101#53587">PSAL League Meet #1</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_174827223"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(174827223)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>26 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/gBivaaKIVSZRLp2UA">2:10.58</a></td><td class="text-nowrap" style="width: 60px;">Apr 9</td><td><a href="meet/443768#53587">Cupertino/De Anza Invite</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_168470829"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(168470829)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>50 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/vOi3ydru3SxDoBWso">2:13.20</a></td><td class="text-nowrap" style="width: 60px;">Apr 16</td><td><a href="meet/445132#53587">Granada Distance & Sprint Festival</a></td><td class="text-muted text-right text-nowrap">O F</td></tr></tbody></table>'''
soup = BeautifulSoup(html)
[t.text for t in soup.select('tr td:nth-of-type(2) [href^="/result"]')]
输出
['2:10.97', '2:05.56', '2:18.54', '2:10.58', '2:13.20']
我正在尝试抓取 athletic.net,一个存储田径比赛时间的网站,以获取每个赛季的给定运动员、他们 运行 参加的每场比赛以及每场比赛的名单他们为每个事件获得的时间。
到目前为止,我已经打印了赛季名称和每个赛事的名称。我现在正试图在 a
标签的海洋中筛选以找到时间。我试过使用 find_next('a')
和 find_next_sibling('a')
但我正在努力隔离时间。
for text in soup.find_all('h5'):
#print season titles and event name neatly
if "Season" in str(text):
text_file.write(('\n' + '\n' + str(text.contents[0])) + '\n')
else:
text_file.write(str(text.contents[0]) + '\n')
#print all siblings
for i in range(0,100):
try:
text = text.find_next_sibling()
text_file.write(str(text) + '\n')
except:
print("miss")
到目前为止,我所能做的就是打印所有兄弟姐妹,其中包含所有时间。例如:
<table class="table table-sm table-responsive table-hover"><tbody><tr id="rID_162222827"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(162222827)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>9 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/AWi088nH1S1rZxdSN">2:10.97</a></td><td class="text-nowrap" style="width: 60px;">Mar 4</td><td><a href="meet/443782#53587">Sunset Invitational</a></td><td class="text-muted text-right text-nowrap">O F</td></tr><tr id="rID_164098252"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164098252)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>60 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/R3iEDYqsnSQ48l0h8">2:05.56</a><a href="/post/R3iEDYqsnSQ48l0h8" rel="nofollow"><small class="text-muted pr-text" style="font-weight:normal; margin-left: 4px;" uib-tooltip="Personal Record">PR</small></a></td><td class="text-nowrap" style="width: 60px;">Mar 19</td><td><a href="meet/441280#53587">Dublin Distance Fiesta</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_164212389"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164212389)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>3 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/16ik6rkINSN6VJEHy">2:18.54</a></td><td class="text-nowrap" style="width: 60px;">Mar 26</td><td><a href="meet/459101#53587">PSAL League Meet #1</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_174827223"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(174827223)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>26 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/gBivaaKIVSZRLp2UA">2:10.58</a></td><td class="text-nowrap" style="width: 60px;">Apr 9</td><td><a href="meet/443768#53587">Cupertino/De Anza Invite</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_168470829"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(168470829)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>50 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/vOi3ydru3SxDoBWso">2:13.20</a></td><td class="text-nowrap" style="width: 60px;">Apr 16</td><td><a href="meet/445132#53587">Granada Distance & Sprint Festival</a></td><td class="text-muted text-right text-nowrap">O F</td></tr></tbody></table>
此输出包含该运动员在最近一个赛季的一项赛事的所有时间。
当存在不包含时间的各种 a
标签时,如何筛选以仅隔离时间?
如果我使用 find_next_sibling('a')
它只会打印 None
.
问题需要一些改进,重点应该提供预期的输出,不太清楚。
How can I sift through to isolate only the times when there are various 'a' tags that don't contain times?
您可以使用 css selectors
获得所有 <a>
时间:
soup.select('tr [href^="/result"]')
或更具体
soup.select('tr td:nth-of-type(2) [href^="/result"]')
例子
from bs4 import BeautifulSoup
html = '''<table class="table table-sm table-responsive table-hover"><tbody><tr id="rID_162222827"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(162222827)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>9 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/AWi088nH1S1rZxdSN">2:10.97</a></td><td class="text-nowrap" style="width: 60px;">Mar 4</td><td><a href="meet/443782#53587">Sunset Invitational</a></td><td class="text-muted text-right text-nowrap">O F</td></tr><tr id="rID_164098252"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164098252)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>60 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/R3iEDYqsnSQ48l0h8">2:05.56</a><a href="/post/R3iEDYqsnSQ48l0h8" rel="nofollow"><small class="text-muted pr-text" style="font-weight:normal; margin-left: 4px;" uib-tooltip="Personal Record">PR</small></a></td><td class="text-nowrap" style="width: 60px;">Mar 19</td><td><a href="meet/441280#53587">Dublin Distance Fiesta</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_164212389"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164212389)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>3 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/16ik6rkINSN6VJEHy">2:18.54</a></td><td class="text-nowrap" style="width: 60px;">Mar 26</td><td><a href="meet/459101#53587">PSAL League Meet #1</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_174827223"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(174827223)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>26 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/gBivaaKIVSZRLp2UA">2:10.58</a></td><td class="text-nowrap" style="width: 60px;">Apr 9</td><td><a href="meet/443768#53587">Cupertino/De Anza Invite</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_168470829"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(168470829)" ng-if="appC.params.edit=='mark' && appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>50 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/vOi3ydru3SxDoBWso">2:13.20</a></td><td class="text-nowrap" style="width: 60px;">Apr 16</td><td><a href="meet/445132#53587">Granada Distance & Sprint Festival</a></td><td class="text-muted text-right text-nowrap">O F</td></tr></tbody></table>'''
soup = BeautifulSoup(html)
[t.text for t in soup.select('tr td:nth-of-type(2) [href^="/result"]')]
输出
['2:10.97', '2:05.56', '2:18.54', '2:10.58', '2:13.20']