使用 Python 从 HTML 中提取具有父标记类型的文本

Extracting text with parent tag type from HTML using Python

我想从一些 HTML 中提取文本和元素类型。例如:

<div>
    some text
    <h1>some header</h1>
    some more text
</div>

应该给:

[{'tag':'div', 'text':'some text'}, {'tag':'h1', 'text':'some header'}, {'tag':'div', 'text':'some more text'}]

如何解析 HTML 以提取此信息?

我试过使用 BeautifulSoup 并且能够在 HTML 中提取一个级别的信息,如下所示:

soup = BeautifulSoup(html, features='html.parser')

for child in soup.findChildren(recursive=False):
    print(child.name)
    for c in child.contents:
        print(c.name)
        print(c.text)

给出以下输出:

div
None
   text here

h1
some header
None
  more text here

使用lxml和递归我可以做到

text = '''<div>
    some text
    <h1>some header</h1>
    some more text
</div>
'''

def display(item):
    print('item:', item)
    print('tag :', item.tag)
    print('text:', item.text.strip())
    tail = item.tail.strip()
    if tail:
        print('tail:', tail, '| parent:', item.getparent().tag)
    
    print('---')
    
    for child in item.getchildren():
        display(child)
        
import lxml.html

soup = lxml.html.fromstring(text)

display(soup)

给出

item: <Element div at 0x7f2b0ed4b6d0>
tag : div
text: some text
---
item: <Element h1 at 0x7f2b0ed3cef0>
tag : h1
text: some header
tail: some more text | parent: div
---

它将 some more text 视为 h1 的尾部,但您可以使用 getparent() 将其分配给 div


小修改后

text = '''<div>
    some text
    <h1>some header</h1>
    some more text
</div>
'''

import lxml.html

results = []

def convert(item):
    results.append({'tag': item.tag, 'text': item.text.strip()})
    
    tail = item.tail.strip()
    
    if tail:
        results.append({'tag': item.getparent().tag, 'text': tail})
    
    for child in item.getchildren():
        convert(child)
        
soup = lxml.html.fromstring(text)

convert(soup)

print(results)

给出结果

[
   {'tag': 'div', 'text': 'some text'}, 
   {'tag': 'h1', 'text': 'some header'}, 
   {'tag': 'div', 'text': 'some more text'}
]

我现在也设法使用 BeautifulSoup 让它工作:

def sanitize(element):
    element = element.replace('\n',' ')
    while '  ' in element:
        element = element.replace('  ', ' ')
    return element.strip()

def parse(soup, tag):
    for child in soup.findChildren(recursive=False):
        name = child.name
        for content in child.contents:
            if not content.name:
                yield sanitize(content.text), name
            else:
                parse(content, name)
                yield sanitize(content.text), content.name

html = """
<div>
    text here
    <h1>some header</h1>
    more text here
</div>
"""

soup = BeautifulSoup(html, features='html.parser')
list(parse(soup, 'html'))

给出:

[('text here', 'div'), ('some header', 'h1'), ('more text here', 'div')]