使用 BeautifulSoup 或重新删除 class 的所有 <div> 标签中的所有 <u> 和 <a> 标签

Question

我正在尝试从 HTML 中具有 class "sf-item" 的所有 DIV 标签中删除 <u> 和 <a> 标签源，因为他们在从网络上抓取时破坏了文本 url。

（对于此演示，我已将示例 html 字符串分配给 BeautifulSoup 方法 - 但理想情况下它是一个网络 URL 作为源）

到目前为止，我已经尝试在下面的行中使用 re - 但我不确定如何在 re 中指定一个条件 - 只删除所有 <u /u> 之间的子字符串 /u> =52=] class sf-item

的标签

data = re.sub('<u.*?u>', '', data)

还尝试使用以下行从整个源中删除所有 <u> 和 <a> 标记，但不知何故它不起作用。我有点不确定如何仅在 DIV 标签内使用 class sf-item.

指定所有 <u> 和 <a> 标签

for tag in soup.find_all('u'):
    tag.replaceWith('')

如果你能帮我实现这个，我将不胜感激。

下面是有效的示例 Python 代码 -

from re import sub
from bs4 import BeautifulSoup
import re

data = """
<div class="sf-item"> The rabbit got to the halfway point at   
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle. 
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap. 
</div>
<div class="sf-item"> Even if the turtle passed him at 
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of 
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""

# data = re.sub('<u.*?u>', '', data)  ## This works for this particular string but I cannot use on a web url
# It would solve if I can somehow specify to remove <u> and <a> only within DIV of class sf-item

soup = BeautifulSoup(data, "html.parser")

for tag in soup.find_all('u'):
    tag.replaceWith('')

fResult = []
rMessage=soup.findAll("div",{'class':"sf-item"})

for result in rMessage:
    fResult.append(sub("&ldquo;|.&rdquo;","","".join(result.contents[0:1]).strip()))

fResult = list(filter(None, fResult))
print(fResult)

我从上面的代码得到的输出是

['The rabbit got to the halfway point at', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at']

但我需要如下输出 -

['The rabbit got to the halfway point at here However, it couldnt see the turtle.', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at Link. he would be able to race to the finish line ahead of place, he just kept going.']

Answer 1

BeautifulSoup 有一个内置方法，用于从标签中获取可见文本（即在浏览器中呈现时将显示的文本）。运行下面的代码，我得到了你预期的输出：

from re import sub
from bs4 import BeautifulSoup
import re

data = """
<div class="sf-item"> The rabbit got to the halfway point at   
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle. 
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap. 
</div>
<div class="sf-item"> Even if the turtle passed him at 
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of 
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""

soup = BeautifulSoup(data, "html.parser")

rMessage=soup.findAll("div",{'class':"sf-item"})

fResult = []

for result in rMessage:
    fResult.append(result.text.replace('\n', ''))

这将为您提供正确的输出，但会有一些额外的空格。如果你想将它们全部减少到单个空格，你可以运行 fResult through this:

fResult = [re.sub(' +', ' ', result) for result in fResult]

使用 BeautifulSoup 或重新删除 class 的所有 <div> 标签中的所有 <u> 和 <a> 标签

Remove all the <u> and <a> tags from within all <div> tags of a class using BeautifulSoup or re

python

beautifulsoup

python-3.x

python-re