使用 BeautifulSoup 省略特定文本

Question

使用 BeautifulSoup 我正在尝试使用自定义 lambda 函数从网站中提取一些非常具体的文本。我正在努力挑选出我需要的东西，同时把我不需要的东西留在外面。

<div class="article__content">
      
            <h3 class="article__headline">
                <span class="article__label barrons">
Barron&#x27;s                    </span>
                
                    <a class="link" href="https://www.marketwatch.com/articles/more-bad-times-ahead-for-these-6-big-tech-stocks-51652197183?mod=mw_quote_news">
                        
                        
                        More Bad Times Ahead for These 6 Big Tech Stocks
                    </a>
            </h3>
        

        
        <div class="article__details">
            <span class="article__timestamp" data-est="2022-05-10T11:39:00">May. 10, 2022 at 11:39 a.m. ET</span>

                
            
        </div>
    </div>

</div>

我只想提取新闻标题 - 在这种情况下，它是“这 6 大科技股的未来更糟糕的时期”，并留下烦人的标题“巴伦”。

到目前为止我的函数看起来像：

for txt in soup.find_all(lambda tag: tag.name == 'h3' and tag.get('class') == ['article__headline']):
     print(txt.text)

我已经尝试 tag.name = "a" 和 tag.get('class') == ['link'] 但是 returns 负载网页中我不需要的其他内容...

Answer 1

尝试 CSS select 或 h3 a（select 所有 <a> 标签在 <h3> 标签):

for title in soup.select("h3 a"):
    print(title.text.strip())

打印：

More Bad Times Ahead for These 6 Big Tech Stocks

如果你想更具体一点：

for title in soup.select("h3.article__headline a"):
    print(title.text.strip())

使用 BeautifulSoup 省略特定文本

Ommitting specific text using BeautifulSoup

python

beautifulsoup

web-scraping