如何使用 Beautiful soup 在 class 中查找标签

Question

我尝试在 class content-inner 中找到所有  标签，但我不想要所有 谈论版权的标签（容器 class 之外的最后一个  标签）在过滤  标签时出现，我的图像显示一个空列表或什么都没有，因此没有图片已保存。

main = requests.get('https://url_on_html.com/')
beautify = BeautifulSoup(main.content,'html5lib')

news = beautify.find_all('div', {'class','jeg_block_container'})
arti = []

for each in news:
    title = each.find('h3', {'class','jeg_post_title'}).text
    lnk = each.a.get('href')
    r = requests.get(lnk)
    soup = BeautifulSoup(r.text,'html5lib')
    content = [i.text.strip() for i in soup.find_all('p')]
    content = ' '.join(content)
    images = [i['src'] for i in soup.find_all('img')]

    arti.append({
        'Headline': title,
        'Link': lnk,
        'image': images,
        'content': content
    })

这个网站 HTML 看起来像这样：

<html><head><title>The simple's story</title></head>
<body>
    <div class="content-inner "><div class="addtoany_share_save_cont"><p>He added: “The President king  administration has embarked on 
    railway construction</p>
    <p>Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
        <script></script>
    <p> we will not once in Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
    <p>the emergency of our matter is Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
    
    <script></script>
    <br></br>
    <script></script>
    <p>king of our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
    <script></script>
    <img src="image.png">
    <p>he is our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
    <p>some weas Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
</div>
</div>
<div>
<p>Copyright © 2021. All Rights Reserved. Vintage Press Limited.  Optimized by <a href="https://inerd360.com/">iNERD360</a></p>
</div>

这将显示一个空列表：

content = [i.text.strip() for i in soup.find_all('div', {'class', 'content-inner'}]

对于图像，此代码也显示空白页：

images = [i['src'] for i in soup.find_all('img',)]

这将过滤 HTML 页面中的所有  标签，而这是我不想要的

content = [i.text.strip() for i in soup.find_all('p')]

如何过滤除 class 之外的最后一个  标签之外的所有  标签？另外，如何使用 bs4?

正确过滤图像

Answer 1

获取所有段落的列表

paragraphs = soup.find_all("p")

生成段落的过滤列表（列表理解）不以字符串“Copyright”开头：

paragraphs = [item.text.strip() for item in paragraphs if not item.text.startswith("Copyright")]

Answer 2

替换：content = [i.text.strip() for i in soup.find_all('p')]

与：

div_list = [div for div in soup.find_all('div', class_="content-inner")]
p_list = [div.find_all('p') for div in div_list]
content = [item.text.strip() for p in p_list for item in p]

其余代码保持不变。这样，您的脚本 returns 包含您要求的所有内容（包括图像）的列表，添加和版权字符串除外。

如何使用 Beautiful soup 在 class 中查找标签

How to find tags inside a class with Beautiful soup

python

django

beautifulsoup

flask

python-beautifultable