从 html 文档中获取特定部分

Question

您好，我想从 html 文档中获取特定部分，该部分与 div 相关并封装在 span 标签中，该部分通常位于文件.

self.contents = BeautifulSoup(convert_pdf_to_html(self.path), 'html.parser')
self.keywords = self.contents.find('span',text=re.compile("(.*keywords.*|.*key-words.*)",re.IGNORECASE)).parent

问题是我总是有一个换行符，这使我无法检索相关的 div，例如：

<span style="font-family: EICMDB+AdvTrebu-B; font-size:8px">keywords
<br/></span>

即使使用正则表达式也不起作用，并且没有删除文本的选项

Answer 1

首先让我告诉你，你的正则表达式有点错误，你必须将 - 转义为 \-

无论如何，类似的东西对我有用，但最近我也无法将正则表达式与 find 结合使用

contents = bs(open(path), 'html.parser')
keywords = contents.find(text = re.compile(ur"key\-?words",re.I|re.U)).parent

从 html 文档中获取特定部分

get a specific section from a html doc

python

beautifulsoup