仅从 html 和 BeautifulSoup 中提取脚本标签内容以外的文本
Extract text only except the content of script tag from html with BeautifulSoup
我有html这样的
<span class="age">
Ages 15
<span class="loc" id="loc_loads1">
</span>
<script>
getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
</script>
</span>
我正在尝试使用 BeautifulSoup
提取 Age 15
所以我写了python代码如下
代码:
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')
age = soup.find("span", {"class": "age"})
print(age.text)
输出:
Age 15 getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
我只想要 Age 15
而不是 script
标签内的函数。有没有办法只获取文本:Age 15
?或以任何方式排除 script
标签的内容?
PS: there are too many script tags and different URLS. I don't prefer
replace text from the output.
使用.find(text=True)
EX:
from bs4 import BeautifulSoup
html = """<span class="age">
Ages 15
<span class="loc" id="loc_loads1">
</span>
<script>
getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
</script>
</span>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.find("span", {"class": "age"}).find(text=True).strip())
输出:
Ages 15
迟到的答案,但为了将来参考,您还可以使用 decompose() 从 html
中删除所有 script
元素,即:
soup = BeautifulSoup(html, "html.parser")
# remove script and style elements
for script in soup(["script", "style"]):
script.decompose()
print(soup.find("span", {"class": "age"}).text.strip())
# Ages 15
我有html这样的
<span class="age">
Ages 15
<span class="loc" id="loc_loads1">
</span>
<script>
getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
</script>
</span>
我正在尝试使用 BeautifulSoup
Age 15
所以我写了python代码如下
代码:
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')
age = soup.find("span", {"class": "age"})
print(age.text)
输出:
Age 15 getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
我只想要 Age 15
而不是 script
标签内的函数。有没有办法只获取文本:Age 15
?或以任何方式排除 script
标签的内容?
PS: there are too many script tags and different URLS. I don't prefer replace text from the output.
使用.find(text=True)
EX:
from bs4 import BeautifulSoup
html = """<span class="age">
Ages 15
<span class="loc" id="loc_loads1">
</span>
<script>
getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
</script>
</span>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.find("span", {"class": "age"}).find(text=True).strip())
输出:
Ages 15
迟到的答案,但为了将来参考,您还可以使用 decompose() 从 html
中删除所有 script
元素,即:
soup = BeautifulSoup(html, "html.parser")
# remove script and style elements
for script in soup(["script", "style"]):
script.decompose()
print(soup.find("span", {"class": "age"}).text.strip())
# Ages 15