抓取网站时如何筛选各种 'a' 标签?

How can I sift through various 'a' tags when scraping a website?

我正在尝试抓取 athletic.net,一个存储田径比赛时间的网站,以获取每个赛季的给定运动员、他们 运行 参加的每场比赛以及每场比赛的名单他们为每个事件获得的时间。

到目前为止,我已经打印了赛季名称和每个赛事的名称。我现在正试图在 a 标签的海洋中筛选以找到时间。我试过使用 find_next('a')find_next_sibling('a') 但我正在努力隔离时间。

for text in soup.find_all('h5'):
    #print season titles and event name neatly
    if "Season" in str(text):
        text_file.write(('\n' + '\n' + str(text.contents[0])) + '\n')
    else:
        text_file.write(str(text.contents[0]) + '\n')

        #print all siblings
        for i in range(0,100):
            try:
                text = text.find_next_sibling()
                text_file.write(str(text) + '\n')
            except:
                print("miss")

到目前为止,我所能做的就是打印所有兄弟姐妹,其中包含所有时间。例如:

<table class="table table-sm table-responsive table-hover"><tbody><tr id="rID_162222827"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(162222827)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>9 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/AWi088nH1S1rZxdSN">2:10.97</a></td><td class="text-nowrap" style="width: 60px;">Mar 4</td><td><a href="meet/443782#53587">Sunset Invitational</a></td><td class="text-muted text-right text-nowrap">O F</td></tr><tr id="rID_164098252"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164098252)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>60 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/R3iEDYqsnSQ48l0h8">2:05.56</a><a href="/post/R3iEDYqsnSQ48l0h8" rel="nofollow"><small class="text-muted pr-text" style="font-weight:normal; margin-left: 4px;" uib-tooltip="Personal Record">PR</small></a></td><td class="text-nowrap" style="width: 60px;">Mar 19</td><td><a href="meet/441280#53587">Dublin Distance Fiesta</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_164212389"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164212389)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>3 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/16ik6rkINSN6VJEHy">2:18.54</a></td><td class="text-nowrap" style="width: 60px;">Mar 26</td><td><a href="meet/459101#53587">PSAL League Meet #1</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_174827223"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(174827223)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>26 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/gBivaaKIVSZRLp2UA">2:10.58</a></td><td class="text-nowrap" style="width: 60px;">Apr 9</td><td><a href="meet/443768#53587">Cupertino/De Anza Invite</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_168470829"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(168470829)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>50 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/vOi3ydru3SxDoBWso">2:13.20</a></td><td class="text-nowrap" style="width: 60px;">Apr 16</td><td><a href="meet/445132#53587">Granada Distance &amp; Sprint Festival</a></td><td class="text-muted text-right text-nowrap">O F</td></tr></tbody></table>

此输出包含该运动员在最近一个赛季的一项赛事的所有时间。

当存在不包含时间的各种 a 标签时,如何筛选以仅隔离时间?

如果我使用 find_next_sibling('a') 它只会打印 None.

问题需要一些改进,重点应该提供预期的输出,不太清楚。

How can I sift through to isolate only the times when there are various 'a' tags that don't contain times?

您可以使用 css selectors 获得所有 <a> 时间:

soup.select('tr [href^="/result"]')

或更具体

soup.select('tr td:nth-of-type(2) [href^="/result"]')
例子
from bs4 import BeautifulSoup

html = '''<table class="table table-sm table-responsive table-hover"><tbody><tr id="rID_162222827"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(162222827)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>9 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/AWi088nH1S1rZxdSN">2:10.97</a></td><td class="text-nowrap" style="width: 60px;">Mar 4</td><td><a href="meet/443782#53587">Sunset Invitational</a></td><td class="text-muted text-right text-nowrap">O F</td></tr><tr id="rID_164098252"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164098252)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>60 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/R3iEDYqsnSQ48l0h8">2:05.56</a><a href="/post/R3iEDYqsnSQ48l0h8" rel="nofollow"><small class="text-muted pr-text" style="font-weight:normal; margin-left: 4px;" uib-tooltip="Personal Record">PR</small></a></td><td class="text-nowrap" style="width: 60px;">Mar 19</td><td><a href="meet/441280#53587">Dublin Distance Fiesta</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_164212389"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164212389)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>3 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/16ik6rkINSN6VJEHy">2:18.54</a></td><td class="text-nowrap" style="width: 60px;">Mar 26</td><td><a href="meet/459101#53587">PSAL League Meet #1</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_174827223"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(174827223)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>26 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/gBivaaKIVSZRLp2UA">2:10.58</a></td><td class="text-nowrap" style="width: 60px;">Apr 9</td><td><a href="meet/443768#53587">Cupertino/De Anza Invite</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_168470829"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(168470829)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>50 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/vOi3ydru3SxDoBWso">2:13.20</a></td><td class="text-nowrap" style="width: 60px;">Apr 16</td><td><a href="meet/445132#53587">Granada Distance &amp; Sprint Festival</a></td><td class="text-muted text-right text-nowrap">O F</td></tr></tbody></table>'''

soup = BeautifulSoup(html)

[t.text for t in soup.select('tr td:nth-of-type(2) [href^="/result"]')]
输出
['2:10.97', '2:05.56', '2:18.54', '2:10.58', '2:13.20']