在 Python 中使用 Beautiful Soup 提取杂乱无章的 HTML 文本
Extracting Messy, Untagged HTML text using Beautiful Soup in Python
我正在尝试使用 BeautifulSoup 解析带有一堆未标记文本的网页。如下例所示,模式是 STRONG 标签中的名称,后跟一系列未加标签的文本,中间夹有换行符。在每个“组”文本的末尾都有一个
标记来表示下一节的开始。
我想暂时将此信息保存在 csv 文件中。我目前的想法是使用 soup.find_all("b") 来获取所有名称。对于检索到的每个名称,我会使用 next_sibling 之类的方法手动循环遍历兄弟姐妹,将文本行添加到我的 csv 文件并忽略换行符。到达
元素后,从 soup.find_all("b") 结果移至下一个“名称”并将 csv 前进到下一行。
我不确定这种思路是否真的会转化为成功。首先,我还没有弄清楚如何 select 每行未标记的文本。我能够找到的各种示例涉及同时 select 在一个页面上添加所有未标记的文本,这对我没有多大帮助。另一个问题是我不确定我建议的“导航”页面内容的方法在逻辑上是否正确。在我完成的实验中,试图获得由 soup.find_all("b") returns none 搅动的元素的 next_sibiling。还没想好那个。
诚然,我对 Beautiful Soup 没有太多经验,而且我已经一分钟没和 HTML 一起工作了。期待了解更多相关信息!
<div class="maincontent">
<b>Thing 1</b>
<br>
Text About Thing 1
<br>
More Text About Thing 1
<br>
Even More Text About Thing 1
<br>
Even MORE Text About Thing 1
<br>
<hr>
<b>Thing 2</b>
<br>
Text About Thing 2
<br>
More Text About Thing 2
<br>
Even More Text About Thing 2
<br>
Even MORE Text About Thing 2
<br>
<hr>
<b>Thing 3</b>
<br>
Text About Thing 3
<br>
More Text About Thing 3
<br>
Even More Text About Thing 3
<br>
Even MORE Text About Thing 3
<br>
<hr>
</div>
编辑: 所需的输出如下所示:
Thing 1,Text About Thing 1,More Text About Thing 1,Even More Text About Thing 1,Even MORE Text About Thing 1
Thing 2,Text About Thing 2,More Text About Thing 2,Even More Text About Thing 2,Even MORE Text About Thing 2
Thing 3,Text About Thing 3,More Text About Thing 3,Even More Text About Thing 3,Even MORE Text About Thing 3
此外,还有一个条件我在示例中忽略了。一些“事物”部分实际上是这样的:
<div class="maincontent">
...
<b>Thing 4</b>
<br>
Text About Thing 4
<br>
Text about
<a href="www.example.com">
Thing 4
</a>
with a link in the middle.
<br>
Even More Text About Thing 4
<br>
Even MORE Text About Thing 4
<br>
<hr>
...
</div>
理想情况下,围绕 link 的句子将被缩减为一个句子,输出以下内容。
Thing4,Text About Thing 4,Text about Thing 4 with a link in the middle,Even More Text About Thing 4,Even MORE Text About Thing 4
相反,我的输出目前看起来像这样使用 HedgeHog 推荐的方法。
Thing4,Text About Thing 4,Text about,Thing 4,with a link in the middle,Even More Text About Thing 4,Even MORE Text About Thing 4
编辑 2:
这是我目前的解决方案,主要基于下面发布的 HedgeHog。
import csv
import requests
from bs4 import BeautifulSoup
URL = "https://www.example.com/"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
maincontent = soup.select_one(".maincontent")
with open('myfile.csv', 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for a in maincontent.findAll('a'):
a.replaceWithChildren()
for b in maincontent.select('b'):
d = [b.text]
isNewElement = True
for t in b.next_siblings:
if t.name == 'b':
break
if isNewElement:
isNewElement = False
if not t.name and t.strip != '':
d.append(t.strip())
else:
if not t.name and t.strip != '':
d[-1] = d[-1] + t
else:
isNewElement = True
writer.writerow(d)
唯一剩下的问题是确保在每个 URL 前后保留正确的空格。我需要做的所有其他事情都涉及读取每个字符串并解析出某些信息,所以我应该从这里开始。谢谢大家!
所描述的路径听起来很有说服力,从我的角度来看,您几乎已经达到了目标。导致问题的预期输出不清楚,这只是指向一个方向:
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(html)
with open('myfile.csv', 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for b in soup.select('b'):
d = [b.text]
for t in b.next_siblings:
if t.name == 'b':
break
if not t.name and t.strip() != '':
d.append(t.strip())
writer.writerow(d)
输出
Thing 1,Text About Thing 1,More Text About Thing 1,Even More Text About Thing 1,Even MORE Text About Thing 1
Thing 2,Text About Thing 2,More Text About Thing 2,Even More Text About Thing 2,Even MORE Text About Thing 2
Thing 3,Text About Thing 3,More Text About Thing 3,Even More Text About Thing 3,Even MORE Text About Thing 3
另一个版本:您可以将主要部分中的所有 <hr>
替换为您选择的分隔符,然后使用 itertools.groupby
来获取单独的文本块,例如:
from itertools import groupby
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser") # <-- html_doc is your HTML from the question
maincontent = soup.select_one(".maincontent")
for hr in maincontent.select("hr"):
hr.replace_with("-" * 80)
text = maincontent.get_text(strip=True, separator="\n")
for is_separator, g in groupby(text.splitlines(), lambda k: k == "-" * 80):
if not is_separator:
print(" ".join(g)) # <-- or store it to file instead printing to screen
打印:
Thing 1 Text About Thing 1 More Text About Thing 1 Even More Text About Thing 1 Even MORE Text About Thing 1
Thing 2 Text About Thing 2 More Text About Thing 2 Even More Text About Thing 2 Even MORE Text About Thing 2
Thing 3 Text About Thing 3 More Text About Thing 3 Even More Text About Thing 3 Even MORE Text About Thing 3
或者直接使用普通的str.split
:
soup = BeautifulSoup(html_doc, "html.parser")
maincontent = soup.select_one(".maincontent")
for hr in maincontent.select("hr"):
hr.replace_with("-" * 80)
text = maincontent.get_text(strip=True, separator="\n")
for group in map(str.strip, text.split("-" * 80)):
if group:
print(group)
print()
打印 3 个块:
Thing 1
Text About Thing 1
More Text About Thing 1
Even More Text About Thing 1
Even MORE Text About Thing 1
Thing 2
Text About Thing 2
More Text About Thing 2
Even More Text About Thing 2
Even MORE Text About Thing 2
Thing 3
Text About Thing 3
More Text About Thing 3
Even More Text About Thing 3
Even MORE Text About Thing 3
我正在尝试使用 BeautifulSoup 解析带有一堆未标记文本的网页。如下例所示,模式是 STRONG 标签中的名称,后跟一系列未加标签的文本,中间夹有换行符。在每个“组”文本的末尾都有一个
标记来表示下一节的开始。
我想暂时将此信息保存在 csv 文件中。我目前的想法是使用 soup.find_all("b") 来获取所有名称。对于检索到的每个名称,我会使用 next_sibling 之类的方法手动循环遍历兄弟姐妹,将文本行添加到我的 csv 文件并忽略换行符。到达
元素后,从 soup.find_all("b") 结果移至下一个“名称”并将 csv 前进到下一行。
我不确定这种思路是否真的会转化为成功。首先,我还没有弄清楚如何 select 每行未标记的文本。我能够找到的各种示例涉及同时 select 在一个页面上添加所有未标记的文本,这对我没有多大帮助。另一个问题是我不确定我建议的“导航”页面内容的方法在逻辑上是否正确。在我完成的实验中,试图获得由 soup.find_all("b") returns none 搅动的元素的 next_sibiling。还没想好那个。
诚然,我对 Beautiful Soup 没有太多经验,而且我已经一分钟没和 HTML 一起工作了。期待了解更多相关信息!
<div class="maincontent">
<b>Thing 1</b>
<br>
Text About Thing 1
<br>
More Text About Thing 1
<br>
Even More Text About Thing 1
<br>
Even MORE Text About Thing 1
<br>
<hr>
<b>Thing 2</b>
<br>
Text About Thing 2
<br>
More Text About Thing 2
<br>
Even More Text About Thing 2
<br>
Even MORE Text About Thing 2
<br>
<hr>
<b>Thing 3</b>
<br>
Text About Thing 3
<br>
More Text About Thing 3
<br>
Even More Text About Thing 3
<br>
Even MORE Text About Thing 3
<br>
<hr>
</div>
编辑: 所需的输出如下所示:
Thing 1,Text About Thing 1,More Text About Thing 1,Even More Text About Thing 1,Even MORE Text About Thing 1
Thing 2,Text About Thing 2,More Text About Thing 2,Even More Text About Thing 2,Even MORE Text About Thing 2
Thing 3,Text About Thing 3,More Text About Thing 3,Even More Text About Thing 3,Even MORE Text About Thing 3
此外,还有一个条件我在示例中忽略了。一些“事物”部分实际上是这样的:
<div class="maincontent">
...
<b>Thing 4</b>
<br>
Text About Thing 4
<br>
Text about
<a href="www.example.com">
Thing 4
</a>
with a link in the middle.
<br>
Even More Text About Thing 4
<br>
Even MORE Text About Thing 4
<br>
<hr>
...
</div>
理想情况下,围绕 link 的句子将被缩减为一个句子,输出以下内容。
Thing4,Text About Thing 4,Text about Thing 4 with a link in the middle,Even More Text About Thing 4,Even MORE Text About Thing 4
相反,我的输出目前看起来像这样使用 HedgeHog 推荐的方法。
Thing4,Text About Thing 4,Text about,Thing 4,with a link in the middle,Even More Text About Thing 4,Even MORE Text About Thing 4
编辑 2:
这是我目前的解决方案,主要基于下面发布的 HedgeHog。
import csv
import requests
from bs4 import BeautifulSoup
URL = "https://www.example.com/"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
maincontent = soup.select_one(".maincontent")
with open('myfile.csv', 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for a in maincontent.findAll('a'):
a.replaceWithChildren()
for b in maincontent.select('b'):
d = [b.text]
isNewElement = True
for t in b.next_siblings:
if t.name == 'b':
break
if isNewElement:
isNewElement = False
if not t.name and t.strip != '':
d.append(t.strip())
else:
if not t.name and t.strip != '':
d[-1] = d[-1] + t
else:
isNewElement = True
writer.writerow(d)
唯一剩下的问题是确保在每个 URL 前后保留正确的空格。我需要做的所有其他事情都涉及读取每个字符串并解析出某些信息,所以我应该从这里开始。谢谢大家!
所描述的路径听起来很有说服力,从我的角度来看,您几乎已经达到了目标。导致问题的预期输出不清楚,这只是指向一个方向:
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(html)
with open('myfile.csv', 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for b in soup.select('b'):
d = [b.text]
for t in b.next_siblings:
if t.name == 'b':
break
if not t.name and t.strip() != '':
d.append(t.strip())
writer.writerow(d)
输出
Thing 1,Text About Thing 1,More Text About Thing 1,Even More Text About Thing 1,Even MORE Text About Thing 1
Thing 2,Text About Thing 2,More Text About Thing 2,Even More Text About Thing 2,Even MORE Text About Thing 2
Thing 3,Text About Thing 3,More Text About Thing 3,Even More Text About Thing 3,Even MORE Text About Thing 3
另一个版本:您可以将主要部分中的所有 <hr>
替换为您选择的分隔符,然后使用 itertools.groupby
来获取单独的文本块,例如:
from itertools import groupby
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser") # <-- html_doc is your HTML from the question
maincontent = soup.select_one(".maincontent")
for hr in maincontent.select("hr"):
hr.replace_with("-" * 80)
text = maincontent.get_text(strip=True, separator="\n")
for is_separator, g in groupby(text.splitlines(), lambda k: k == "-" * 80):
if not is_separator:
print(" ".join(g)) # <-- or store it to file instead printing to screen
打印:
Thing 1 Text About Thing 1 More Text About Thing 1 Even More Text About Thing 1 Even MORE Text About Thing 1
Thing 2 Text About Thing 2 More Text About Thing 2 Even More Text About Thing 2 Even MORE Text About Thing 2
Thing 3 Text About Thing 3 More Text About Thing 3 Even More Text About Thing 3 Even MORE Text About Thing 3
或者直接使用普通的str.split
:
soup = BeautifulSoup(html_doc, "html.parser")
maincontent = soup.select_one(".maincontent")
for hr in maincontent.select("hr"):
hr.replace_with("-" * 80)
text = maincontent.get_text(strip=True, separator="\n")
for group in map(str.strip, text.split("-" * 80)):
if group:
print(group)
print()
打印 3 个块:
Thing 1
Text About Thing 1
More Text About Thing 1
Even More Text About Thing 1
Even MORE Text About Thing 1
Thing 2
Text About Thing 2
More Text About Thing 2
Even More Text About Thing 2
Even MORE Text About Thing 2
Thing 3
Text About Thing 3
More Text About Thing 3
Even More Text About Thing 3
Even MORE Text About Thing 3