使用 Python 从 HTML 中提取具有父标记类型的文本
Extracting text with parent tag type from HTML using Python
我想从一些 HTML 中提取文本和元素类型。例如:
<div>
some text
<h1>some header</h1>
some more text
</div>
应该给:
[{'tag':'div', 'text':'some text'}, {'tag':'h1', 'text':'some header'}, {'tag':'div', 'text':'some more text'}]
如何解析 HTML 以提取此信息?
我试过使用 BeautifulSoup
并且能够在 HTML 中提取一个级别的信息,如下所示:
soup = BeautifulSoup(html, features='html.parser')
for child in soup.findChildren(recursive=False):
print(child.name)
for c in child.contents:
print(c.name)
print(c.text)
给出以下输出:
div
None
text here
h1
some header
None
more text here
使用lxml
和递归我可以做到
text = '''<div>
some text
<h1>some header</h1>
some more text
</div>
'''
def display(item):
print('item:', item)
print('tag :', item.tag)
print('text:', item.text.strip())
tail = item.tail.strip()
if tail:
print('tail:', tail, '| parent:', item.getparent().tag)
print('---')
for child in item.getchildren():
display(child)
import lxml.html
soup = lxml.html.fromstring(text)
display(soup)
给出
item: <Element div at 0x7f2b0ed4b6d0>
tag : div
text: some text
---
item: <Element h1 at 0x7f2b0ed3cef0>
tag : h1
text: some header
tail: some more text | parent: div
---
它将 some more text
视为 h1
的尾部,但您可以使用 getparent()
将其分配给 div
小修改后
text = '''<div>
some text
<h1>some header</h1>
some more text
</div>
'''
import lxml.html
results = []
def convert(item):
results.append({'tag': item.tag, 'text': item.text.strip()})
tail = item.tail.strip()
if tail:
results.append({'tag': item.getparent().tag, 'text': tail})
for child in item.getchildren():
convert(child)
soup = lxml.html.fromstring(text)
convert(soup)
print(results)
给出结果
[
{'tag': 'div', 'text': 'some text'},
{'tag': 'h1', 'text': 'some header'},
{'tag': 'div', 'text': 'some more text'}
]
我现在也设法使用 BeautifulSoup 让它工作:
def sanitize(element):
element = element.replace('\n',' ')
while ' ' in element:
element = element.replace(' ', ' ')
return element.strip()
def parse(soup, tag):
for child in soup.findChildren(recursive=False):
name = child.name
for content in child.contents:
if not content.name:
yield sanitize(content.text), name
else:
parse(content, name)
yield sanitize(content.text), content.name
html = """
<div>
text here
<h1>some header</h1>
more text here
</div>
"""
soup = BeautifulSoup(html, features='html.parser')
list(parse(soup, 'html'))
给出:
[('text here', 'div'), ('some header', 'h1'), ('more text here', 'div')]
我想从一些 HTML 中提取文本和元素类型。例如:
<div>
some text
<h1>some header</h1>
some more text
</div>
应该给:
[{'tag':'div', 'text':'some text'}, {'tag':'h1', 'text':'some header'}, {'tag':'div', 'text':'some more text'}]
如何解析 HTML 以提取此信息?
我试过使用 BeautifulSoup
并且能够在 HTML 中提取一个级别的信息,如下所示:
soup = BeautifulSoup(html, features='html.parser')
for child in soup.findChildren(recursive=False):
print(child.name)
for c in child.contents:
print(c.name)
print(c.text)
给出以下输出:
div
None
text here
h1
some header
None
more text here
使用lxml
和递归我可以做到
text = '''<div>
some text
<h1>some header</h1>
some more text
</div>
'''
def display(item):
print('item:', item)
print('tag :', item.tag)
print('text:', item.text.strip())
tail = item.tail.strip()
if tail:
print('tail:', tail, '| parent:', item.getparent().tag)
print('---')
for child in item.getchildren():
display(child)
import lxml.html
soup = lxml.html.fromstring(text)
display(soup)
给出
item: <Element div at 0x7f2b0ed4b6d0>
tag : div
text: some text
---
item: <Element h1 at 0x7f2b0ed3cef0>
tag : h1
text: some header
tail: some more text | parent: div
---
它将 some more text
视为 h1
的尾部,但您可以使用 getparent()
将其分配给 div
小修改后
text = '''<div>
some text
<h1>some header</h1>
some more text
</div>
'''
import lxml.html
results = []
def convert(item):
results.append({'tag': item.tag, 'text': item.text.strip()})
tail = item.tail.strip()
if tail:
results.append({'tag': item.getparent().tag, 'text': tail})
for child in item.getchildren():
convert(child)
soup = lxml.html.fromstring(text)
convert(soup)
print(results)
给出结果
[
{'tag': 'div', 'text': 'some text'},
{'tag': 'h1', 'text': 'some header'},
{'tag': 'div', 'text': 'some more text'}
]
我现在也设法使用 BeautifulSoup 让它工作:
def sanitize(element):
element = element.replace('\n',' ')
while ' ' in element:
element = element.replace(' ', ' ')
return element.strip()
def parse(soup, tag):
for child in soup.findChildren(recursive=False):
name = child.name
for content in child.contents:
if not content.name:
yield sanitize(content.text), name
else:
parse(content, name)
yield sanitize(content.text), content.name
html = """
<div>
text here
<h1>some header</h1>
more text here
</div>
"""
soup = BeautifulSoup(html, features='html.parser')
list(parse(soup, 'html'))
给出:
[('text here', 'div'), ('some header', 'h1'), ('more text here', 'div')]