如何使用 Beautiful soup 在 class 中查找标签
How to find tags inside a class with Beautiful soup
我尝试在 class content-inner 中找到所有 <p>
标签,但我不想要所有 <p>
谈论版权的标签(容器 class 之外的最后一个 <p>
标签)在过滤 <p>
标签时出现,我的图像显示一个空列表或什么都没有,因此没有图片已保存。
main = requests.get('https://url_on_html.com/')
beautify = BeautifulSoup(main.content,'html5lib')
news = beautify.find_all('div', {'class','jeg_block_container'})
arti = []
for each in news:
title = each.find('h3', {'class','jeg_post_title'}).text
lnk = each.a.get('href')
r = requests.get(lnk)
soup = BeautifulSoup(r.text,'html5lib')
content = [i.text.strip() for i in soup.find_all('p')]
content = ' '.join(content)
images = [i['src'] for i in soup.find_all('img')]
arti.append({
'Headline': title,
'Link': lnk,
'image': images,
'content': content
})
这个网站 HTML 看起来像这样:
<html><head><title>The simple's story</title></head>
<body>
<div class="content-inner "><div class="addtoany_share_save_cont"><p>He added: “The President king administration has embarked on
railway construction</p>
<p>Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<script></script>
<p> we will not once in Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<p>the emergency of our matter is Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<script></script>
<br></br>
<script></script>
<p>king of our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<script></script>
<img src="image.png">
<p>he is our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<p>some weas Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
</div>
</div>
<div>
<p>Copyright © 2021. All Rights Reserved. Vintage Press Limited. Optimized by <a href="https://inerd360.com/">iNERD360</a></p>
</div>
这将显示一个空列表:
content = [i.text.strip() for i in soup.find_all('div', {'class', 'content-inner'}]
对于图像,此代码也显示空白页:
images = [i['src'] for i in soup.find_all('img',)]
这将过滤 HTML 页面中的所有 <p>
标签,而这是我不想要的
content = [i.text.strip() for i in soup.find_all('p')]
如何过滤除 class 之外的最后一个 <p>
标签之外的所有 <p>
标签?另外,如何使用 bs4
?
正确过滤图像
获取所有段落的列表
paragraphs = soup.find_all("p")
生成段落的过滤列表(列表理解)不 以字符串“Copyright”开头:
paragraphs = [item.text.strip() for item in paragraphs if not item.text.startswith("Copyright")]
替换:content = [i.text.strip() for i in soup.find_all('p')]
与:
div_list = [div for div in soup.find_all('div', class_="content-inner")]
p_list = [div.find_all('p') for div in div_list]
content = [item.text.strip() for p in p_list for item in p]
其余代码保持不变。
这样,您的脚本 returns 包含您要求的所有内容(包括图像)的列表,添加和版权字符串除外。
我尝试在 class content-inner 中找到所有 <p>
标签,但我不想要所有 <p>
谈论版权的标签(容器 class 之外的最后一个 <p>
标签)在过滤 <p>
标签时出现,我的图像显示一个空列表或什么都没有,因此没有图片已保存。
main = requests.get('https://url_on_html.com/')
beautify = BeautifulSoup(main.content,'html5lib')
news = beautify.find_all('div', {'class','jeg_block_container'})
arti = []
for each in news:
title = each.find('h3', {'class','jeg_post_title'}).text
lnk = each.a.get('href')
r = requests.get(lnk)
soup = BeautifulSoup(r.text,'html5lib')
content = [i.text.strip() for i in soup.find_all('p')]
content = ' '.join(content)
images = [i['src'] for i in soup.find_all('img')]
arti.append({
'Headline': title,
'Link': lnk,
'image': images,
'content': content
})
这个网站 HTML 看起来像这样:
<html><head><title>The simple's story</title></head>
<body>
<div class="content-inner "><div class="addtoany_share_save_cont"><p>He added: “The President king administration has embarked on
railway construction</p>
<p>Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<script></script>
<p> we will not once in Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<p>the emergency of our matter is Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<script></script>
<br></br>
<script></script>
<p>king of our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<script></script>
<img src="image.png">
<p>he is our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<p>some weas Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
</div>
</div>
<div>
<p>Copyright © 2021. All Rights Reserved. Vintage Press Limited. Optimized by <a href="https://inerd360.com/">iNERD360</a></p>
</div>
这将显示一个空列表:
content = [i.text.strip() for i in soup.find_all('div', {'class', 'content-inner'}]
对于图像,此代码也显示空白页:
images = [i['src'] for i in soup.find_all('img',)]
这将过滤 HTML 页面中的所有 <p>
标签,而这是我不想要的
content = [i.text.strip() for i in soup.find_all('p')]
如何过滤除 class 之外的最后一个 <p>
标签之外的所有 <p>
标签?另外,如何使用 bs4
?
获取所有段落的列表
paragraphs = soup.find_all("p")
生成段落的过滤列表(列表理解)不 以字符串“Copyright”开头:
paragraphs = [item.text.strip() for item in paragraphs if not item.text.startswith("Copyright")]
替换:content = [i.text.strip() for i in soup.find_all('p')]
与:
div_list = [div for div in soup.find_all('div', class_="content-inner")]
p_list = [div.find_all('p') for div in div_list]
content = [item.text.strip() for p in p_list for item in p]
其余代码保持不变。 这样,您的脚本 returns 包含您要求的所有内容(包括图像)的列表,添加和版权字符串除外。