遍历 html 中的所有元素并将内容替换为 Beautifulsoup

Question

在我的数据库中，我存储 HTML 来自自定义 CMS 的所见即所得编辑器。内容是英文的，我想使用 Beautifulsoup 遍历每个元素，将其内容翻译成德语（使用另一个 class，翻译器）并将当前元素的值替换为翻译文本。

到目前为止，我已经能够结合 Beautifulsoup 的 .findAll 函数为 p、a、pre 提出特定的 select 或我不清楚如何简单地遍历所有元素并即时替换它们的内容，而不是必须根据特定类型进行过滤。

编辑器生成的一个非常基本的 HTML 示例，涵盖所有不同类型：

<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<p>Normal text</p>
<p><strong>Bold text</strong></p>
<p><em>Italic text </em></p>
<p><br></p>
<blockquote>Quote</blockquote>
<p>text after quote</p>
<p><br></p>
<p><br></p>
<pre class="code-syntax" spellcheck="false">code</pre>
<p><br></p>
<p>text after code</p>
<p><br></p>
<p><a href="https://google.com/" target="_blank">This is a search engine</a></p>
<p><br></p>
<p><img src="https://via.placeholder.com/350x150"></p>

bs4 文档向我指出了一个 replace_with 函数，如果我只能 select 每个元素一个接一个，而不必专门 select 某些东西，这将是理想的。

欢迎指点

Answer 1

您基本上可以这样做来遍历每个元素：

html="""
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<p>Normal text</p>
<p><strong>Bold text</strong></p>
<p><em>Italic text </em></p>
<p><br></p>
<blockquote>Quote</blockquote>
<p>text after quote</p>
<p><br></p>
<p><br></p>
<pre class="code-syntax" spellcheck="false">code</pre>
<p><br></p>
<p>text after code</p>
<p><br></p>
<p><a href="https://google.com/" target="_blank">This is a search engine</a></p>
<p><br></p>
<p><img src="https://via.placeholder.com/350x150"></p>
"""

from bs4 import BeautifulSoup

soup=BeautifulSoup(html,"lxml")
for x in soup.findAll():
    print(x.text)
    # You can try this as well
    print(x.find(text=True,recursive=False))
    # I think this will return result as you expect.

输出：

Heading 1
Heading 2
Heading 3
Normal text
Bold text
Italic text 

Quote
text after quote


code

text after code

This is a search engine



Heading 1
Heading 2
Heading 3
Normal text
Bold text
Italic text 

Quote
text after quote


code

text after code

This is a search engine



Heading 1
Heading 2
Heading 3
Normal text
Bold text
Bold text
Italic text 
Italic text 


Quote
text after quote




code


text after code


This is a search engine
This is a search engine

而且我相信你有翻译功能，你也知道如何替换它。

Answer 2

这里有一个关于如何使用 BeautifulSoup 替换字符串的小示例代码。在您的情况下，您需要一个初步步骤，获取语言之间的映射，字典可能就是这种情况。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml') # or use any other parser

new_string = 'xxx' # replace each string with the same value
_ = [s.replace_with(new_string) for s in soup.find_all(string=True)]

print(soup.prettify())

遍历 html 中的所有元素并将内容替换为 Beautifulsoup

Iterate over all elements in html and replace content with Beautifulsoup

html

python

beautifulsoup

html-parsing