使用 Beautifulsoup 进行网络抓取 - 输出无意中合并的单词（例如，ThisHappens）

Question

我正在尝试通过网络抓取一些研究摘要，但有些词刚刚被合并在一起。不幸的是，它不够一致，我只能做 outputexample.replace("WordMerge","") 之类的事情。

例如，在我的代码中提供的URL中，输出的第一行是：

AbstractsPublic AbstractDownload this abstract: English (pdf) | Español (pdf) | Audio Recording (mp3)

我想避免这种情况的发生，尽量保留原文和格式。

 import requests
 import time
 from bs4 import BeautifulSoup
 import re

 urlsummary ='https://www.pcori.org/research-results/2013/testing-new- 
 ways-schedule-appointments-community-health-centers-help-patients'
 html = requests.get(urlsummary).content
 soup = BeautifulSoup(html, 'lxml')

 abstract = soup.find(class_='pane pane--node').get_text()
 print(abstract)

Answer 1

只需使用

.get_text(" ")

来自the docs：

You can specify a string to be used to join the bits of text together:

使用 Beautifulsoup 进行网络抓取 - 输出无意中合并的单词（例如，ThisHappens）

web scraping with Beautifulsoup - output unintentionally merging words (e.g., ThisHappens)

python

beautifulsoup

text-parsing

web-scraping