如何摆脱文本上方的空白,使用 bs4

How to get rid of whitespace above text, using bs4

好的,所以我正在使用 bs4 (BeautifulSoup) 来解析网站并找到我要查找的特定标题。我的代码如下所示:

import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
    if i.a:
        print(i.a.text.replace('\n', '').strip())
    else:
        print(i.contents[0].strip())

此代码有效,但在从网站打印请求的标题之前,它在输出中首先显示了 20 行空白。我的代码有问题吗?或者我可以做些什么来摆脱空白?

因为你有这样的元素:

<article class="article-short">
<div class="thumb"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather"><img alt="FILE: Boys who have undergone a circumcision ceremony walk near Qunu in the Eastern Cape in 2013. Picture: AFP." height="147" src="http://ewn.co.za/cdn/-%2fmedia%2f3C37CB28056746CD95FC913757AAD41C.ashx%3fas%3d1%26h%3d147%26w%3d234%26crop%3d1;waeb9b8157b3e310df" width="234"/></a></div>
<h6 class="h6-mega"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather">Contralesa against scrapping initiation due to cold weather</a></h6>
</article>

其中第一个 link 包含图像但没有文本。

您可能应该寻找 h6 标签。所以,像这样的东西有效:

import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
    title = (i.h6.text.replace('\n', '') if i.h6 else contents[0]).strip()
    if title:
        print(title)