在 Python 中使用 Beautiful Soup 提取杂乱无章的 HTML 文本

Question

我正在尝试使用 BeautifulSoup 解析带有一堆未标记文本的网页。如下例所示，模式是 STRONG 标签中的名称，后跟一系列未加标签的文本，中间夹有换行符。在每个“组”文本的末尾都有一个

标记来表示下一节的开始。

我想暂时将此信息保存在 csv 文件中。我目前的想法是使用 soup.find_all("b") 来获取所有名称。对于检索到的每个名称，我会使用 next_sibling 之类的方法手动循环遍历兄弟姐妹，将文本行添加到我的 csv 文件并忽略换行符。到达

元素后，从 soup.find_all("b") 结果移至下一个“名称”并将 csv 前进到下一行。

我不确定这种思路是否真的会转化为成功。首先，我还没有弄清楚如何 select 每行未标记的文本。我能够找到的各种示例涉及同时 select 在一个页面上添加所有未标记的文本，这对我没有多大帮助。另一个问题是我不确定我建议的“导航”页面内容的方法在逻辑上是否正确。在我完成的实验中，试图获得由 soup.find_all("b") returns none 搅动的元素的 next_sibiling。还没想好那个。

诚然，我对 Beautiful Soup 没有太多经验，而且我已经一分钟没和 HTML 一起工作了。期待了解更多相关信息！

<div class="maincontent">
    <b>Thing 1</b>
    <br>
    Text About Thing 1
    <br>
    More Text About Thing 1
    <br>
    Even More Text About Thing 1
    <br>
    Even MORE Text About Thing 1
    <br>
    <hr>
    <b>Thing 2</b>
    <br>
    Text About Thing 2
    <br>
    More Text About Thing 2
    <br>
    Even More Text About Thing 2
    <br>
    Even MORE Text About Thing 2
    <br>
    <hr>
    <b>Thing 3</b>
    <br>
    Text About Thing 3
    <br>
    More Text About Thing 3
    <br>
    Even More Text About Thing 3
    <br>
    Even MORE Text About Thing 3
    <br>
    <hr>
</div>

编辑： 所需的输出如下所示：

Thing 1,Text About Thing 1,More Text About Thing 1,Even More Text About Thing 1,Even MORE Text About Thing 1
Thing 2,Text About Thing 2,More Text About Thing 2,Even More Text About Thing 2,Even MORE Text About Thing 2
Thing 3,Text About Thing 3,More Text About Thing 3,Even More Text About Thing 3,Even MORE Text About Thing 3

此外，还有一个条件我在示例中忽略了。一些“事物”部分实际上是这样的：

<div class="maincontent">
    ...
    <b>Thing 4</b>
    <br>
    Text About Thing 4
    <br>
     Text about 
     <a href="www.example.com">
       Thing 4
     </a>
     with a link in the middle.
    <br>
    Even More Text About Thing 4
    <br>
    Even MORE Text About Thing 4
    <br>
    <hr>
    ...
</div>

理想情况下，围绕 link 的句子将被缩减为一个句子，输出以下内容。

Thing4,Text About Thing 4,Text about Thing 4 with a link in the middle,Even More Text About Thing 4,Even MORE Text About Thing 4

相反，我的输出目前看起来像这样使用 HedgeHog 推荐的方法。

Thing4,Text About Thing 4,Text about,Thing 4,with a link in the middle,Even More Text About Thing 4,Even MORE Text About Thing 4

编辑 2：

这是我目前的解决方案，主要基于下面发布的 HedgeHog。

import csv
import requests
from bs4 import BeautifulSoup

URL = "https://www.example.com/"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
maincontent = soup.select_one(".maincontent")

with open('myfile.csv', 'w') as csv_file:
    writer = csv.writer(csv_file, delimiter=',')

    for a in maincontent.findAll('a'):
        a.replaceWithChildren()

    for b in maincontent.select('b'):
        d = [b.text]
        isNewElement = True
        for t in b.next_siblings:
            if t.name == 'b':
                break
            if isNewElement:
                isNewElement = False
                if not t.name and t.strip != '':
                    d.append(t.strip())
            else:
                if not t.name and t.strip != '':
                    d[-1] = d[-1] + t
                else:
                    isNewElement = True
        writer.writerow(d)

唯一剩下的问题是确保在每个 URL 前后保留正确的空格。我需要做的所有其他事情都涉及读取每个字符串并解析出某些信息，所以我应该从这里开始。谢谢大家！

Answer 1

所描述的路径听起来很有说服力，从我的角度来看，您几乎已经达到了目标。导致问题的预期输出不清楚，这只是指向一个方向：

from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup(html)
with open('myfile.csv', 'w') as csv_file:
    writer = csv.writer(csv_file, delimiter=',')

    for b in soup.select('b'):
        d = [b.text]
        for t in b.next_siblings:
            if t.name == 'b':
                break
            if not t.name and t.strip() != '':
                d.append(t.strip())
        writer.writerow(d)

输出

Thing 1,Text About Thing 1,More Text About Thing 1,Even More Text About Thing 1,Even MORE Text About Thing 1
Thing 2,Text About Thing 2,More Text About Thing 2,Even More Text About Thing 2,Even MORE Text About Thing 2
Thing 3,Text About Thing 3,More Text About Thing 3,Even More Text About Thing 3,Even MORE Text About Thing 3

Answer 2

另一个版本：您可以将主要部分中的所有 <hr> 替换为您选择的分隔符，然后使用 itertools.groupby 来获取单独的文本块，例如：

from itertools import groupby
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "html.parser") # <-- html_doc is your HTML from the question

maincontent = soup.select_one(".maincontent")
for hr in maincontent.select("hr"):
    hr.replace_with("-" * 80)

text = maincontent.get_text(strip=True, separator="\n")

for is_separator, g in groupby(text.splitlines(), lambda k: k == "-" * 80):
    if not is_separator:
        print(" ".join(g))  # <-- or store it to file instead printing to screen

打印：

Thing 1 Text About Thing 1 More Text About Thing 1 Even More Text About Thing 1 Even MORE Text About Thing 1
Thing 2 Text About Thing 2 More Text About Thing 2 Even More Text About Thing 2 Even MORE Text About Thing 2
Thing 3 Text About Thing 3 More Text About Thing 3 Even More Text About Thing 3 Even MORE Text About Thing 3

或者直接使用普通的str.split:

soup = BeautifulSoup(html_doc, "html.parser")

maincontent = soup.select_one(".maincontent")
for hr in maincontent.select("hr"):
    hr.replace_with("-" * 80)

text = maincontent.get_text(strip=True, separator="\n")

for group in map(str.strip, text.split("-" * 80)):
    if group:
        print(group)
        print()

打印 3 个块：

Thing 1
Text About Thing 1
More Text About Thing 1
Even More Text About Thing 1
Even MORE Text About Thing 1

Thing 2
Text About Thing 2
More Text About Thing 2
Even More Text About Thing 2
Even MORE Text About Thing 2

Thing 3
Text About Thing 3
More Text About Thing 3
Even More Text About Thing 3
Even MORE Text About Thing 3

在 Python 中使用 Beautiful Soup 提取杂乱无章的 HTML 文本

Extracting Messy, Untagged HTML text using Beautiful Soup in Python

html

python

beautifulsoup

html-parsing