使用 BeautifulSoup 或重新删除 class 的所有 <div> 标签中的所有 <u> 和 <a> 标签
Remove all the <u> and <a> tags from within all <div> tags of a class using BeautifulSoup or re
我正在尝试从 HTML 中具有 class "sf-item" 的所有 DIV 标签中删除 <u>
和 <a>
标签源,因为他们在从网络上抓取时破坏了文本 url。
(对于此演示,我已将示例 html 字符串分配给 BeautifulSoup 方法 - 但理想情况下它是一个网络 URL 作为源)
到目前为止,我已经尝试在下面的行中使用 re - 但我不确定如何在 re 中指定一个条件 - 只删除所有 <u
/u>
之间的子字符串 /u>
=52=] class sf-item
的标签
data = re.sub('<u.*?u>', '', data)
还尝试使用以下行从整个源中删除所有 <u>
和 <a>
标记,但不知何故它不起作用。我有点不确定如何仅在 DIV 标签内使用 class sf-item.
指定所有 <u>
和 <a>
标签
for tag in soup.find_all('u'):
tag.replaceWith('')
如果你能帮我实现这个,我将不胜感激。
下面是有效的示例 Python 代码 -
from re import sub
from bs4 import BeautifulSoup
import re
data = """
<div class="sf-item"> The rabbit got to the halfway point at
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle.
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap.
</div>
<div class="sf-item"> Even if the turtle passed him at
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""
# data = re.sub('<u.*?u>', '', data) ## This works for this particular string but I cannot use on a web url
# It would solve if I can somehow specify to remove <u> and <a> only within DIV of class sf-item
soup = BeautifulSoup(data, "html.parser")
for tag in soup.find_all('u'):
tag.replaceWith('')
fResult = []
rMessage=soup.findAll("div",{'class':"sf-item"})
for result in rMessage:
fResult.append(sub("“|.”","","".join(result.contents[0:1]).strip()))
fResult = list(filter(None, fResult))
print(fResult)
我从上面的代码得到的输出是
['The rabbit got to the halfway point at', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at']
但我需要如下输出 -
['The rabbit got to the halfway point at here However, it couldnt see the turtle.', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at Link. he would be able to race to the finish line ahead of place, he just kept going.']
BeautifulSoup 有一个内置方法,用于从标签中获取可见文本(即在浏览器中呈现时将显示的文本)。 运行 下面的代码,我得到了你预期的输出:
from re import sub
from bs4 import BeautifulSoup
import re
data = """
<div class="sf-item"> The rabbit got to the halfway point at
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle.
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap.
</div>
<div class="sf-item"> Even if the turtle passed him at
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""
soup = BeautifulSoup(data, "html.parser")
rMessage=soup.findAll("div",{'class':"sf-item"})
fResult = []
for result in rMessage:
fResult.append(result.text.replace('\n', ''))
这将为您提供正确的输出,但会有一些额外的空格。如果你想将它们全部减少到单个空格,你可以 运行 fResult through this:
fResult = [re.sub(' +', ' ', result) for result in fResult]
我正在尝试从 HTML 中具有 class "sf-item" 的所有 DIV 标签中删除 <u>
和 <a>
标签源,因为他们在从网络上抓取时破坏了文本 url。
(对于此演示,我已将示例 html 字符串分配给 BeautifulSoup 方法 - 但理想情况下它是一个网络 URL 作为源)
到目前为止,我已经尝试在下面的行中使用 re - 但我不确定如何在 re 中指定一个条件 - 只删除所有 <u
/u>
之间的子字符串 /u>
=52=] class sf-item
data = re.sub('<u.*?u>', '', data)
还尝试使用以下行从整个源中删除所有 <u>
和 <a>
标记,但不知何故它不起作用。我有点不确定如何仅在 DIV 标签内使用 class sf-item.
<u>
和 <a>
标签
for tag in soup.find_all('u'):
tag.replaceWith('')
如果你能帮我实现这个,我将不胜感激。
下面是有效的示例 Python 代码 -
from re import sub
from bs4 import BeautifulSoup
import re
data = """
<div class="sf-item"> The rabbit got to the halfway point at
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle.
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap.
</div>
<div class="sf-item"> Even if the turtle passed him at
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""
# data = re.sub('<u.*?u>', '', data) ## This works for this particular string but I cannot use on a web url
# It would solve if I can somehow specify to remove <u> and <a> only within DIV of class sf-item
soup = BeautifulSoup(data, "html.parser")
for tag in soup.find_all('u'):
tag.replaceWith('')
fResult = []
rMessage=soup.findAll("div",{'class':"sf-item"})
for result in rMessage:
fResult.append(sub("“|.”","","".join(result.contents[0:1]).strip()))
fResult = list(filter(None, fResult))
print(fResult)
我从上面的代码得到的输出是
['The rabbit got to the halfway point at', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at']
但我需要如下输出 -
['The rabbit got to the halfway point at here However, it couldnt see the turtle.', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at Link. he would be able to race to the finish line ahead of place, he just kept going.']
BeautifulSoup 有一个内置方法,用于从标签中获取可见文本(即在浏览器中呈现时将显示的文本)。 运行 下面的代码,我得到了你预期的输出:
from re import sub
from bs4 import BeautifulSoup
import re
data = """
<div class="sf-item"> The rabbit got to the halfway point at
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle.
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap.
</div>
<div class="sf-item"> Even if the turtle passed him at
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""
soup = BeautifulSoup(data, "html.parser")
rMessage=soup.findAll("div",{'class':"sf-item"})
fResult = []
for result in rMessage:
fResult.append(result.text.replace('\n', ''))
这将为您提供正确的输出,但会有一些额外的空格。如果你想将它们全部减少到单个空格,你可以 运行 fResult through this:
fResult = [re.sub(' +', ' ', result) for result in fResult]