使用正则表达式解析 XML

Question

我想解析一些标签。

模式是

<div id="tags">blah-blah<a href="http://url/tag">What_I_Want</a></div>

我认为它有效

re.findall(">"."</a></div>")

但事实并非如此

这有什么问题？

------------更新一------------ 现在我知道 html.

不好用

raj 给我一个答案

>>> from bs4 import BeautifulSoup
>>> s = '<div id="tags">blah-blah<a href="http://url/tag">What_I_Want</a></div>'
>>> soup = BeautifulSoup(s)
>>> soup.select('div > a:first')[0].text
'What_I_Want'

我还有一个问题。我怎样才能找到

<div id blah blah </div>

在整个文件中？

Answer 1

简答：你不能

不同的简答：Python XML parser（甚至有例子）

Answer 2

您似乎在尝试获取父标签 div 的直接子标签 a 的文本。

>>> from bs4 import BeautifulSoup
>>> s = '<div id="tags">blah-blah<a href="http://url/tag">What_I_Want</a></div>'
>>> soup = BeautifulSoup(s)
>>> soup.select('div > a:first')[0].text
'What_I_Want'
>>> soup.select('div > a')[0].text
'What_I_Want'

使用正则表达式解析 XML

Parse XML using regular expressions

python

parsing

beautifulsoup