如何抓取 DIV 具有 Class 或包含特定文本的 ID
How to get scrape DIV having Class or ID containing specific text
我有一些 HTML 从网站上抓取的
<div>
<div id="content1">
</div>
<div id="content3">
</div>
<div id="content22">
</div>
</div>
如何遍历所有 ID 以 content
开头的 DIV?
最简单的方法是使用 CSS selector:
soup.select('div[id^=content]')
^=
语法指定 id
属性值应 start with content
.
您可以使用 regular expression filter passed in as the id
argument to element.find_all()
:
获得相同的结果
import re
soup.find_all('div', id=re.compile('^content'))
演示:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <div>
... <div id="content1">
... </div>
... <div id="content3">
... </div>
... <div id="content22">
... </div>
... </div>
... '''
>>> soup = BeautifulSoup(sample)
>>> soup.select('div[id^=content]')
[<div id="content1">
</div>, <div id="content3">
</div>, <div id="content22">
</div>]
>>> soup.find_all('div', id=re.compile('^content'))
[<div id="content1">
</div>, <div id="content3">
</div>, <div id="content22">
</div>]
我有一些 HTML 从网站上抓取的
<div>
<div id="content1">
</div>
<div id="content3">
</div>
<div id="content22">
</div>
</div>
如何遍历所有 ID 以 content
开头的 DIV?
最简单的方法是使用 CSS selector:
soup.select('div[id^=content]')
^=
语法指定 id
属性值应 start with content
.
您可以使用 regular expression filter passed in as the id
argument to element.find_all()
:
import re
soup.find_all('div', id=re.compile('^content'))
演示:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <div>
... <div id="content1">
... </div>
... <div id="content3">
... </div>
... <div id="content22">
... </div>
... </div>
... '''
>>> soup = BeautifulSoup(sample)
>>> soup.select('div[id^=content]')
[<div id="content1">
</div>, <div id="content3">
</div>, <div id="content22">
</div>]
>>> soup.find_all('div', id=re.compile('^content'))
[<div id="content1">
</div>, <div id="content3">
</div>, <div id="content22">
</div>]