BeautifulSoup4 数据从 HTML5 data-* 标签中提取
BeautifulSoup4 data extract from HTML5 data-* tag
我只想从以下标签中提取内部文本 24,000.00:
<span class="itm-price mrs ">
<span data-currency-iso="BDT">৳</span>
<span dir="ltr" data-price="24000">24,000.00</span>
</span>
我要提取数据的页面中有很多类似的标签。
我正在尝试这样做:
for price in soup.find_all('span', {'class': 'itm-price'}):
item_price = price.get('data-price')
print(item_price)
但输出即将到来:None
我从 Bs4 doc
了解到,对于 html5 data-*
标签,我们应该使用:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
由于我是新手,所以我仍然无法使用该方法带来结果。
使用查找方法:
>>>from bs4 import BeautifulSoup
>>>url="""<span class="itm-price mrs "><span data-currency-iso="BDT">৳</span><span dir="ltr" data-price="24000">24,000.00</span></span>"""
>>>soup.find("span",dir="ltr").string
'24,000.00'
你可以试试这个
>>> import re
>>> from bs4 import BeautifulSoup
>>> html_doc = """
... <span class="itm-price mrs ">
... <span data-currency-iso="BDT">৳</span>
... <span dir="ltr" data-price="24000">24,000.00</span>
... </span>
... <span class="itm-price mrs ">
... <span data-currency-iso="BDT">৳</span>
... <span dir="ltr" data-price="25000">25,000.00</span>
... </span>
... <span class="itm-price mrs ">
... <span data-currency-iso="BDT">৳</span>
... <span dir="ltr" data-price="blabla">blabla</span>
... </span>
... """
>>> soup = BeautifulSoup(html_doc, 'html.parser')
>>> soup.find("span", dir="ltr").attrs['data-price']
# You can loop over
>>> for price_span in soup.find_all("span", attrs={"dir": "ltr", "data-price": re.compile(r"\d+")}):
... print(price_span.attrs.get("data-price", None))
# output
24000
25000
既然可以直接访问,为什么还要找周边<span>
?此外,您可以使用关键字参数(虽然我理解为什么您不想使用 class
属性尝试它,因为它是一个 Python 关键字)。
get_test()
方法将从一对匹配的标签之间提取内容,因此您最终得到一个非常简单的程序:
# coding=utf-8
data = u"""\
<span class="itm-price mrs ">
<span data-currency-iso="BDT">৳</span>
<span dir="ltr" data-price="24000">24,000.00</span>
</span>
"""
import bs4
soup = bs4.BeautifulSoup(data)
for price in soup.find_all('span', dir="ltr"):
print(price.get_text())
我只想从以下标签中提取内部文本 24,000.00:
<span class="itm-price mrs ">
<span data-currency-iso="BDT">৳</span>
<span dir="ltr" data-price="24000">24,000.00</span>
</span>
我要提取数据的页面中有很多类似的标签。
我正在尝试这样做:
for price in soup.find_all('span', {'class': 'itm-price'}):
item_price = price.get('data-price')
print(item_price)
但输出即将到来:None
我从 Bs4 doc
了解到,对于 html5 data-*
标签,我们应该使用:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
由于我是新手,所以我仍然无法使用该方法带来结果。
使用查找方法:
>>>from bs4 import BeautifulSoup
>>>url="""<span class="itm-price mrs "><span data-currency-iso="BDT">৳</span><span dir="ltr" data-price="24000">24,000.00</span></span>"""
>>>soup.find("span",dir="ltr").string
'24,000.00'
你可以试试这个
>>> import re
>>> from bs4 import BeautifulSoup
>>> html_doc = """
... <span class="itm-price mrs ">
... <span data-currency-iso="BDT">৳</span>
... <span dir="ltr" data-price="24000">24,000.00</span>
... </span>
... <span class="itm-price mrs ">
... <span data-currency-iso="BDT">৳</span>
... <span dir="ltr" data-price="25000">25,000.00</span>
... </span>
... <span class="itm-price mrs ">
... <span data-currency-iso="BDT">৳</span>
... <span dir="ltr" data-price="blabla">blabla</span>
... </span>
... """
>>> soup = BeautifulSoup(html_doc, 'html.parser')
>>> soup.find("span", dir="ltr").attrs['data-price']
# You can loop over
>>> for price_span in soup.find_all("span", attrs={"dir": "ltr", "data-price": re.compile(r"\d+")}):
... print(price_span.attrs.get("data-price", None))
# output
24000
25000
既然可以直接访问,为什么还要找周边<span>
?此外,您可以使用关键字参数(虽然我理解为什么您不想使用 class
属性尝试它,因为它是一个 Python 关键字)。
get_test()
方法将从一对匹配的标签之间提取内容,因此您最终得到一个非常简单的程序:
# coding=utf-8
data = u"""\
<span class="itm-price mrs ">
<span data-currency-iso="BDT">৳</span>
<span dir="ltr" data-price="24000">24,000.00</span>
</span>
"""
import bs4
soup = bs4.BeautifulSoup(data)
for price in soup.find_all('span', dir="ltr"):
print(price.get_text())