如何使用 Beautiful soup 在 class 中查找标签

How to find tags inside a class with Beautiful soup

我尝试在 class content-inner 中找到所有 <p> 标签,但我不想要所有 <p>谈论版权的标签(容器 class 之外的最后一个 <p> 标签)在过滤 <p> 标签时出现,我的图像显示一个空列表或什么都没有,因此没有图片已保存。

main = requests.get('https://url_on_html.com/')
beautify = BeautifulSoup(main.content,'html5lib')

news = beautify.find_all('div', {'class','jeg_block_container'})
arti = []

for each in news:
    title = each.find('h3', {'class','jeg_post_title'}).text
    lnk = each.a.get('href')
    r = requests.get(lnk)
    soup = BeautifulSoup(r.text,'html5lib')
    content = [i.text.strip() for i in soup.find_all('p')]
    content = ' '.join(content)
    images = [i['src'] for i in soup.find_all('img')]

    arti.append({
        'Headline': title,
        'Link': lnk,
        'image': images,
        'content': content
    })

这个网站 HTML 看起来像这样:

<html><head><title>The simple's story</title></head>
<body>
    <div class="content-inner "><div class="addtoany_share_save_cont"><p>He added: “The President king  administration has embarked on 
    railway construction</p>
    <p>Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
        <script></script>
    <p> we will not once in Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
    <p>the emergency of our matter is Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
    
    <script></script>
    <br></br>
    <script></script>
    <p>king of our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
    <script></script>
    <img src="image.png">
    <p>he is our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
    <p>some weas Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
</div>
</div>
<div>
<p>Copyright © 2021. All Rights Reserved. Vintage Press Limited.  Optimized by <a href="https://inerd360.com/">iNERD360</a></p>
</div>

这将显示一个空列表:

content = [i.text.strip() for i in soup.find_all('div', {'class', 'content-inner'}]

对于图像,此代码也显示空白页:

images = [i['src'] for i in soup.find_all('img',)]

这将过滤 HTML 页面中的所有 <p> 标签,而这是我不想要的

content = [i.text.strip() for i in soup.find_all('p')]

如何过滤除 class 之外的最后一个 <p> 标签之外的所有 <p> 标签?另外,如何使用 bs4?

正确过滤图像

获取所有段落的列表

paragraphs = soup.find_all("p")

生成段落的过滤列表(列表理解) 以字符串“Copyright”开头:

paragraphs = [item.text.strip() for item in paragraphs if not item.text.startswith("Copyright")]

替换:content = [i.text.strip() for i in soup.find_all('p')]

与:

div_list = [div for div in soup.find_all('div', class_="content-inner")]
p_list = [div.find_all('p') for div in div_list]
content = [item.text.strip() for p in p_list for item in p]

其余代码保持不变。 这样,您的脚本 returns 包含您要求的所有内容(包括图像)的列表,添加和版权字符串除外。