BeautifulSoup 连接不同段落中的单词

Question

我有一个 EPUB 文件需要使用。我正在尝试从文件中存在的 HTML 文件中提取文本。当我运行 soup.get_text() 在我提取的 HTML 内容上时，所有段落都连接在一起，将单词组合在一起。

我尝试用空格替换所有   和  标签。我还尝试将解析器从 html.parser 更改为 html5lib.

with self._epub.open(html_file) as chapter:
    html_content = chapter.read().decode('utf-8')
    html_content = html_content.replace('</br>', ' ')
    html_content = html_content.replace('<br>', ' ')
    soup = bs4.BeautifulSoup(html_content, features="html5lib")
    clean_content = soup.get_text()

输入HTML:

第 1 段。第 1 行

第 2 行

预期输出：

第 1 段。 第 1 行第 2 行

实际输出： 第 1 段。 Line1Line2

Answer 1

你可以像 that.Once 那样做你得到 html。

from bs4 import BeautifulSoup

html='''<p>Paragraph1. Line 1</p><p>Line 2<p>'''

    soup=BeautifulSoup(html,'html.parser')
    itemtext=''
    for item in soup.select('p'):
        itemtext+=item.text + ' '

    print(itemtext.strip())

输出：

Paragraph1. Line 1 Line 2

BeautifulSoup 连接不同段落中的单词

BeautifulSoup joins words in different paragraphs

beautifulsoup

epub

python-3.x