如何使用 Beautiful Soup 从 HTML 获取文本
How to get the text from the HTML using Beautiful Soup
我想知道如何从 HTML:
中获取文本 A1 Pawn
<tr id="overview-summary-current">
<th scope="row">
<span class="edit-tools">
<a href="#background-experience" class="edit-section" id="control_gen_4">Edit experience</a>
<script id="controlinit-dust-server-65573249-4" type="text/javascript+initialized" class="li-control">LI.Controls.addControl("control-dust-server-65573249-4","IntraScroller",{tracking:'top-card-edit-experience',paddingTop:-20})</script>
<script type="text/javascript">if(dust&&dust.jsControl){if(!dust.jsControl.flushControlIds){dust.jsControl.flushControlIds="";}else{dust.jsControl.flushControlIds+=",";}dust.jsControl.flushControlIds+="control-dust-server-65573249-4";}</script>
</span>
<a href="#background-experience" data-trk="prof-0-ovw-curr_pos">Current</a>
</th>
<td>
<ol>
<li>
<span data-tracking="mcp_profile_sum" class="new-miniprofile-container /biz/miniprofile/8241336?pathWildcard=8241336" data-li-url="/biz/miniprofile/8241336?pathWildcard=8241336" data-li-getjs="https://static.licdn.com/scds/concat/common/js?h=40vfeoewuurexnhvi1o2qiknu&fc=2" data-li-miniprofile-id="LI-2326069">
<strong>
<a href="/company/8241336?trk=prof-0-ovw-curr_pos" dir="auto">A1 Pawn</a>
</strong>
</span>
</li>
</ol>
</td>
我试过使用 CSS 选择器和 xpath 来获取文本
使用 CSS 选择器不起作用:
str(profilePageSource.find_element_by_css_selector("#overview-summary-current > td > ol > li > span > strong > a").get_text().encode("utf-8"))[2:-1]
使用 Xpath 无效:
str(profilePageSource.find_element_by_xpath("//*[@id=\"overview-summary-current\"]/td/ol/li/span/strong/a").get_text().encode("utf-8"))[2:-1]
对于 CSS 选择器,您应该使用 soup.select()
方法,而不是 .find_element_by_css_selector
。示例 -
elems = profilePageSource.select("#overview-summary-current > td > ol > li > span > strong > a")
if elems:
print(str(elems[0].get_text().encode("utf-8"))[2:-1]))
演示 -
>>> s = """<tr id="overview-summary-current">
... <th scope="row">
... <span class="edit-tools">
... <a href="#background-experience" class="edit-section" id="control_gen_4">Edit experience</a>
... <script id="controlinit-dust-server-65573249-4" type="text/javascript+initialized" class="li-control">LI.Controls.addControl("control-dust-server-65573249-4","IntraScroller",{tracking:'top-card-edit-experience',paddingTop:-20})</script>
... <script type="text/javascript">if(dust&&dust.jsControl){if(!dust.jsControl.flushControlIds){dust.jsControl.flushControlIds="";}else{dust.jsControl.flushControlIds+=",";}dust.jsControl.flushControlIds+="control-dust-server-65573249-4";}</script>
... </span>
... <a href="#background-experience" data-trk="prof-0-ovw-curr_pos">Current</a>
... </th>
... <td>
... <ol>
... <li>
... <span data-tracking="mcp_profile_sum" class="new-miniprofile-container /biz/miniprofile/8241336?pathWildcard=8241336" data-li-url="/biz/miniprofile/8241336?pathWildcard=8241336" data-li-getjs="https://static.licdn.com/scds/concat/common/js?h=40vfeoewuurexnhvi1o2qiknu&fc=2" data-li-miniprofile-id="LI-2326069">
... <strong>
... <a href="/company/8241336?trk=prof-0-ovw-curr_pos" dir="auto">A1 Pawn</a>
... </strong>
... </span>
... </li>
... </ol>
... </td>"""
>>>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s,'html.parser')
>>> soup.select("#overview-summary-current > td > ol > li > span > strong > a")
[<a dir="auto" href="/company/8241336?trk=prof-0-ovw-curr_pos">A1 Pawn</a>]
soup.find(id='overview-summary-current').td.a.text
应该会给你结果。
你也可以通过以下方式得到结果
soup.find('a', {'dir': "auto"}).text
我想知道如何从 HTML:
中获取文本A1 Pawn
<tr id="overview-summary-current">
<th scope="row">
<span class="edit-tools">
<a href="#background-experience" class="edit-section" id="control_gen_4">Edit experience</a>
<script id="controlinit-dust-server-65573249-4" type="text/javascript+initialized" class="li-control">LI.Controls.addControl("control-dust-server-65573249-4","IntraScroller",{tracking:'top-card-edit-experience',paddingTop:-20})</script>
<script type="text/javascript">if(dust&&dust.jsControl){if(!dust.jsControl.flushControlIds){dust.jsControl.flushControlIds="";}else{dust.jsControl.flushControlIds+=",";}dust.jsControl.flushControlIds+="control-dust-server-65573249-4";}</script>
</span>
<a href="#background-experience" data-trk="prof-0-ovw-curr_pos">Current</a>
</th>
<td>
<ol>
<li>
<span data-tracking="mcp_profile_sum" class="new-miniprofile-container /biz/miniprofile/8241336?pathWildcard=8241336" data-li-url="/biz/miniprofile/8241336?pathWildcard=8241336" data-li-getjs="https://static.licdn.com/scds/concat/common/js?h=40vfeoewuurexnhvi1o2qiknu&fc=2" data-li-miniprofile-id="LI-2326069">
<strong>
<a href="/company/8241336?trk=prof-0-ovw-curr_pos" dir="auto">A1 Pawn</a>
</strong>
</span>
</li>
</ol>
</td>
我试过使用 CSS 选择器和 xpath 来获取文本
使用 CSS 选择器不起作用:
str(profilePageSource.find_element_by_css_selector("#overview-summary-current > td > ol > li > span > strong > a").get_text().encode("utf-8"))[2:-1]
使用 Xpath 无效:
str(profilePageSource.find_element_by_xpath("//*[@id=\"overview-summary-current\"]/td/ol/li/span/strong/a").get_text().encode("utf-8"))[2:-1]
对于 CSS 选择器,您应该使用 soup.select()
方法,而不是 .find_element_by_css_selector
。示例 -
elems = profilePageSource.select("#overview-summary-current > td > ol > li > span > strong > a")
if elems:
print(str(elems[0].get_text().encode("utf-8"))[2:-1]))
演示 -
>>> s = """<tr id="overview-summary-current">
... <th scope="row">
... <span class="edit-tools">
... <a href="#background-experience" class="edit-section" id="control_gen_4">Edit experience</a>
... <script id="controlinit-dust-server-65573249-4" type="text/javascript+initialized" class="li-control">LI.Controls.addControl("control-dust-server-65573249-4","IntraScroller",{tracking:'top-card-edit-experience',paddingTop:-20})</script>
... <script type="text/javascript">if(dust&&dust.jsControl){if(!dust.jsControl.flushControlIds){dust.jsControl.flushControlIds="";}else{dust.jsControl.flushControlIds+=",";}dust.jsControl.flushControlIds+="control-dust-server-65573249-4";}</script>
... </span>
... <a href="#background-experience" data-trk="prof-0-ovw-curr_pos">Current</a>
... </th>
... <td>
... <ol>
... <li>
... <span data-tracking="mcp_profile_sum" class="new-miniprofile-container /biz/miniprofile/8241336?pathWildcard=8241336" data-li-url="/biz/miniprofile/8241336?pathWildcard=8241336" data-li-getjs="https://static.licdn.com/scds/concat/common/js?h=40vfeoewuurexnhvi1o2qiknu&fc=2" data-li-miniprofile-id="LI-2326069">
... <strong>
... <a href="/company/8241336?trk=prof-0-ovw-curr_pos" dir="auto">A1 Pawn</a>
... </strong>
... </span>
... </li>
... </ol>
... </td>"""
>>>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s,'html.parser')
>>> soup.select("#overview-summary-current > td > ol > li > span > strong > a")
[<a dir="auto" href="/company/8241336?trk=prof-0-ovw-curr_pos">A1 Pawn</a>]
soup.find(id='overview-summary-current').td.a.text
应该会给你结果。
你也可以通过以下方式得到结果
soup.find('a', {'dir': "auto"}).text