如何使用 ElementTree 解析 HTML 来查找特定的 RegEx?
How to parse HTML using ElementTree to find a particular RegEx?
使用 Python 2.7.6 与 ElementTree 一起从文件系统加载/解析 HTML 文件,然后遍历文件以将特定的 RegEx 存储到数据结构中。
因此,在我的项目文件夹中,我有一个名为 person.html:
的 HTML 文件
<!DOCTYPE html>
<html>
<body>
<ul>
<li>Name: $name</li>
<li>Age: $age</li>
</ul>
</body>
</html>
到目前为止,这是我的 Python 脚本 (main.py):
#!/usr/bin/env python
import web
import xml.etree.ElementTree as ElementTree
tree = ET.parse(person.html)
问题:
如何使用以 $
开头的 RegEx 或 ElementTree 值(例如 $name
和 $age
)进行解析?
如何将这些值存储到我以后可以迭代的数据结构中?
像这样使用 RegEx 怎么样:
>>> html = """
... <!DOCTYPE html>
... <html>
... <body>
... <ul>
... <li>Name: $name</li>
... <li>Age: $age</li>
... </ul>
... </body>
... </html>
... """
>>> import re
>>> re.findall(r'$\w*', html)
['$name', '$age']
>>>
re.findall()
return 一个列表,这样您就可以像这样使用它们:
>>> l = re.findall(r'$\w*', html)
>>> l
['$name', '$age']
>>> l[0]
'$name'
>>> l[1]
'$age'
>>>
lxml
用于通过标签搜索html。例如,如果您想找到所有 <li>
标签,并获取它们的文本:
import xml.etree.ElementTree as et
tree = et.parse('data.html')
html_tag = tree.getroot()
for li in html_tag.iter('li'):
text = li.text
print(text)
--output:--
Name: $name
Age: $age
如果你的目标文本可以在任何标签中,那么你可以这样做:
import xml.etree.ElementTree as et
import re
tree = et.parse('data.html')
html_tag = tree.getroot()
pattern = r"""
$
.*?
\b
"""
for tag in html_tag.iter('*'): # '*' => all tags
text = tag.text.strip()
if text:
match_list = re.findall(pattern, text, flags=re.X)
print (match_list)
--output:--
['$name']
['$age']
How do I store these values into a data structure that I could iterate
through in the future?
您可以使用 shelve
模块:
$ cat data.html
<!DOCTYPE html>
<html>
<body>
<ul>
<li>Name: $name</li>
<li>Age: $age</li>
<li>Dogs: $dog1, $dog2</li>
</ul>
</body>
</html>
import xml.etree.ElementTree as et
import re
import shelve
import collections as coll
tree = et.parse('data.html')
html_tag = tree.getroot()
pattern = r"""
$ #Match a literal $ sign...
.+? #followed by any character, 1 or more times, non-greedy
\b #followed by the (first) word boundary
"""
results = coll.defaultdict(list)
for tag in html_tag.iter('*'):
text = tag.text.strip()
if text:
match_list = re.findall(pattern, text, flags=re.X)
if match_list:
results['data.html'].extend(match_list)
print(results)
with shelve.open('mydb.db') as db:
db['html vars'] = results
with shelve.open('mydb.db') as db:
for key, val in db['html vars'].items():
print("{}: {}".format(key, val))
--output:--
defaultdict(<class 'list'>, {'data.html': ['$name', '$age', '$dog1', '$dog2']})
data.html: ['$name', '$age', '$dog1', '$dog2']
如果您的最终目标是替换 html 中的那些变量,您的格式适合 python 的 template
格式:
import string
with open('data.html') as f:
template = string.Template(f.read())
values = {
'name': 'socal_javaguy',
'age': 25,
'dog1': 'Rover',
'dog2': 'Jane',
}
results = template.substitute(values)
print(results)
--output:--
<!DOCTYPE html>
<html>
<body>
<ul>
<li>Name: socal_javaguy</li>
<li>Age: 25</li>
<li>Dogs: Rover, Jane</li>
</ul>
</body>
</html>
感谢 Kevin 和 7stud,我是这样工作的:
#!/usr/bin/env python
import re
with open("person.html", "r") as html_file:
data=html_file.read()
list_of_strings = re.findall(r'$[A-Za-z]+[A-Za-z0-9]*', data)
print list_of_strings
输出:
[$name, $age]
使用 Python 2.7.6 与 ElementTree 一起从文件系统加载/解析 HTML 文件,然后遍历文件以将特定的 RegEx 存储到数据结构中。
因此,在我的项目文件夹中,我有一个名为 person.html:
的 HTML 文件<!DOCTYPE html>
<html>
<body>
<ul>
<li>Name: $name</li>
<li>Age: $age</li>
</ul>
</body>
</html>
到目前为止,这是我的 Python 脚本 (main.py):
#!/usr/bin/env python
import web
import xml.etree.ElementTree as ElementTree
tree = ET.parse(person.html)
问题:
如何使用以
$
开头的 RegEx 或 ElementTree 值(例如$name
和$age
)进行解析?如何将这些值存储到我以后可以迭代的数据结构中?
像这样使用 RegEx 怎么样:
>>> html = """
... <!DOCTYPE html>
... <html>
... <body>
... <ul>
... <li>Name: $name</li>
... <li>Age: $age</li>
... </ul>
... </body>
... </html>
... """
>>> import re
>>> re.findall(r'$\w*', html)
['$name', '$age']
>>>
re.findall()
return 一个列表,这样您就可以像这样使用它们:
>>> l = re.findall(r'$\w*', html)
>>> l
['$name', '$age']
>>> l[0]
'$name'
>>> l[1]
'$age'
>>>
lxml
用于通过标签搜索html。例如,如果您想找到所有 <li>
标签,并获取它们的文本:
import xml.etree.ElementTree as et
tree = et.parse('data.html')
html_tag = tree.getroot()
for li in html_tag.iter('li'):
text = li.text
print(text)
--output:--
Name: $name
Age: $age
如果你的目标文本可以在任何标签中,那么你可以这样做:
import xml.etree.ElementTree as et
import re
tree = et.parse('data.html')
html_tag = tree.getroot()
pattern = r"""
$
.*?
\b
"""
for tag in html_tag.iter('*'): # '*' => all tags
text = tag.text.strip()
if text:
match_list = re.findall(pattern, text, flags=re.X)
print (match_list)
--output:--
['$name']
['$age']
How do I store these values into a data structure that I could iterate through in the future?
您可以使用 shelve
模块:
$ cat data.html
<!DOCTYPE html>
<html>
<body>
<ul>
<li>Name: $name</li>
<li>Age: $age</li>
<li>Dogs: $dog1, $dog2</li>
</ul>
</body>
</html>
import xml.etree.ElementTree as et
import re
import shelve
import collections as coll
tree = et.parse('data.html')
html_tag = tree.getroot()
pattern = r"""
$ #Match a literal $ sign...
.+? #followed by any character, 1 or more times, non-greedy
\b #followed by the (first) word boundary
"""
results = coll.defaultdict(list)
for tag in html_tag.iter('*'):
text = tag.text.strip()
if text:
match_list = re.findall(pattern, text, flags=re.X)
if match_list:
results['data.html'].extend(match_list)
print(results)
with shelve.open('mydb.db') as db:
db['html vars'] = results
with shelve.open('mydb.db') as db:
for key, val in db['html vars'].items():
print("{}: {}".format(key, val))
--output:--
defaultdict(<class 'list'>, {'data.html': ['$name', '$age', '$dog1', '$dog2']})
data.html: ['$name', '$age', '$dog1', '$dog2']
如果您的最终目标是替换 html 中的那些变量,您的格式适合 python 的 template
格式:
import string
with open('data.html') as f:
template = string.Template(f.read())
values = {
'name': 'socal_javaguy',
'age': 25,
'dog1': 'Rover',
'dog2': 'Jane',
}
results = template.substitute(values)
print(results)
--output:--
<!DOCTYPE html>
<html>
<body>
<ul>
<li>Name: socal_javaguy</li>
<li>Age: 25</li>
<li>Dogs: Rover, Jane</li>
</ul>
</body>
</html>
感谢 Kevin 和 7stud,我是这样工作的:
#!/usr/bin/env python
import re
with open("person.html", "r") as html_file:
data=html_file.read()
list_of_strings = re.findall(r'$[A-Za-z]+[A-Za-z0-9]*', data)
print list_of_strings
输出:
[$name, $age]