beautifulsoup 解析 - 处理上标?
beautifulsoup parsing - dealing with superscript?
这是我要从中提取信息的 HTML 片段:
<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td><td class="yfnc_tabledata1"><span id="yfs_j10_aal">33.57B</span></td></tr>
网页上的样子:
Market Cap (intraday)5:33.57B
我有什么(不起作用):
HTML_MarketCap = soup.find('sup', text='5').find_next_sibling('span').text
如何提取 33.57B 字符串?
跨度不是兄弟姐妹,它是 child 祖父母 堂兄妹的兄弟姐妹,一旦删除(感谢,1.618)。
from bs4 import BeautifulSoup as bs
soup = bs("""<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)
<font size="-1"><sup>5</sup></font>:</td><td class="yfnc_tabledata1">
<span id="yfs_j10_aal">33.57B</span></td></tr>""")
soup.find("sup", text="5").parent.parent.find_next_sibling("td").find("span").text
# u'33.57B'
既然你似乎有问题,这是我的完整测试脚本(使用 python-requests),它对我来说很可靠:
import requests
from bs4 import BeautifulSoup as bs
url = "https://finance.yahoo.com/q/ks?s=AAL+Key+Statistics"
r = requests.get(url)
soup = bs(r.text)
HTML_MarketCap = soup.find("sup", text="5").parent.parent.find_next_sibling("td").find("span").text
print HTML_MarketCap
或者,您可以在找到 <sup>5</sup>
元素后简单地使用 find_next()
,如下所示:
from bs4 import BeautifulSoup
s = '''<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td><td class="yfnc_tabledata1"><span id="yfs_j10_aal">33.57B</span></td></tr>'''
soup =BeautifulSoup(s)
sup = soup.find('sup', text='5')
sup.find_next('span')
Out[5]: <span id="yfs_j10_aal">33.57B</span>
sup.find_next('span').text
Out[6]: u'33.57B'
>>>help(sup.find_next)
Help on method find_next in module bs4.element:
find_next(self, name=None, attrs={}, text=None, **kwargs) method of
bs4.element.Tag instance
Returns the first item that matches the given criteria and
appears after this Tag in the document.
这是我要从中提取信息的 HTML 片段:
<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td><td class="yfnc_tabledata1"><span id="yfs_j10_aal">33.57B</span></td></tr>
网页上的样子:
Market Cap (intraday)5:33.57B
我有什么(不起作用):
HTML_MarketCap = soup.find('sup', text='5').find_next_sibling('span').text
如何提取 33.57B 字符串?
跨度不是兄弟姐妹,它是 child 祖父母 堂兄妹的兄弟姐妹,一旦删除(感谢,1.618)。
from bs4 import BeautifulSoup as bs
soup = bs("""<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)
<font size="-1"><sup>5</sup></font>:</td><td class="yfnc_tabledata1">
<span id="yfs_j10_aal">33.57B</span></td></tr>""")
soup.find("sup", text="5").parent.parent.find_next_sibling("td").find("span").text
# u'33.57B'
既然你似乎有问题,这是我的完整测试脚本(使用 python-requests),它对我来说很可靠:
import requests
from bs4 import BeautifulSoup as bs
url = "https://finance.yahoo.com/q/ks?s=AAL+Key+Statistics"
r = requests.get(url)
soup = bs(r.text)
HTML_MarketCap = soup.find("sup", text="5").parent.parent.find_next_sibling("td").find("span").text
print HTML_MarketCap
或者,您可以在找到 <sup>5</sup>
元素后简单地使用 find_next()
,如下所示:
from bs4 import BeautifulSoup
s = '''<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td><td class="yfnc_tabledata1"><span id="yfs_j10_aal">33.57B</span></td></tr>'''
soup =BeautifulSoup(s)
sup = soup.find('sup', text='5')
sup.find_next('span')
Out[5]: <span id="yfs_j10_aal">33.57B</span>
sup.find_next('span').text
Out[6]: u'33.57B'
>>>help(sup.find_next)
Help on method find_next in module bs4.element:
find_next(self, name=None, attrs={}, text=None, **kwargs) method of bs4.element.Tag instance Returns the first item that matches the given criteria and appears after this Tag in the document.