仅从 html 和 BeautifulSoup 中提取脚本标签内容以外的文本

Question

我有html这样的

<span class="age">
    Ages 15
    <span class="loc" id="loc_loads1">
     </span>
     <script>
        getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
     </script>
</span>

我正在尝试使用 BeautifulSoup

提取 Age 15

所以我写了python代码如下

代码：

from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)

soup = bs(page.data, 'html.parser')
age = soup.find("span", {"class": "age"})

print(age.text)

输出：

Age 15 getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);

我只想要 Age 15 而不是 script 标签内的函数。有没有办法只获取文本：Age 15？或以任何方式排除 script 标签的内容？

PS: there are too many script tags and different URLS. I don't prefer replace text from the output.

Answer 1

使用.find(text=True)

EX:

from bs4 import BeautifulSoup

html = """<span class="age">
    Ages 15
    <span class="loc" id="loc_loads1">
     </span>
     <script>
        getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
     </script>
</span>"""

soup = BeautifulSoup(html, "html.parser")
print(soup.find("span", {"class": "age"}).find(text=True).strip())

输出：

Ages 15

Answer 2

迟到的答案，但为了将来参考，您还可以使用 decompose() 从 html 中删除所有 script 元素，即：

soup = BeautifulSoup(html, "html.parser")                  
# remove script and style elements                         
for script in soup(["script", "style"]):                   
    script.decompose()                                     
print(soup.find("span", {"class": "age"}).text.strip())    
# Ages 15

仅从 html 和 BeautifulSoup 中提取脚本标签内容以外的文本

Extract text only except the content of script tag from html with BeautifulSoup

python

beautifulsoup

urllib3

python-3.x