beautifulsoup 使用正则表达式过滤后代

Question

我正在尝试将 beautiful-soup 用于 DOM 中包含符合过滤条件的子项的 return 元素。

在下面的示例中，我想 return 都 divs 基于在子元素中找到正则表达式匹配。

<body>
<div class="randomclass1">
    <span class="randomclass">regexmatch1</span>
    <h2>title</h2>
</div>
<div class="randomclass2">
    <span class="randomclass">regexmatch2</span>
    <h2>title</h2>
</div>
</body>

基本代码设置如下：

from bs4 import BeautifulSoup as soup
page = soup(html)
Results = page.find_all('div')

如何添加用于评估目标 div 的子项的正则表达式测试？即，我如何将下面的正则表达式调用添加到 beautiful-soup 的 'find' 或 'find_all' 函数中？

re.compile('regexmatch\d')

Answer 1

我采用的方法是 find_parent，这将 return beautifulsoup 结果的父元素，而不管用于查找原始结果的方法（正则表达式或其他） .对于上面的例子：

childOfResults = page.find_all('span', string=re.compile('regexmatch\d'))
Results = childOfResult[0].find_parent()

...使用您选择的循环进行修改以循环遍历 childOfResult

的所有成员

Answer 2

获取第一个 div 然后运行 for 循环所有 div's

例子

from bs4 import BeautifulSoup

html = """<body>
           <div class="randomclass1">
               <span class="randomclass">regexmatch1</span>
               <h2>title</h2>
           </div>
           <div class="randomclass2">
               <span class="randomclass">regexmatch2</span>
               <h2>title</h2>
           </div>
       </body>"""

page_soup = BeautifulSoup(html, features='html.parser')
elements = page_soup.select('body > div')

for element in elements:
    print(element.select("span:nth-child(1)")[0].text)

打印出来

regexmatch1
regexmatch2

beautifulsoup 使用正则表达式过滤后代

beautifulsoup filtering descendents using regex

python

beautifulsoup