如何摆脱文本上方的空白，使用 bs4

Question

好的，所以我正在使用 bs4 (BeautifulSoup) 来解析网站并找到我要查找的特定标题。我的代码如下所示：

import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
    if i.a:
        print(i.a.text.replace('\n', '').strip())
    else:
        print(i.contents[0].strip())

此代码有效，但在从网站打印请求的标题之前，它在输出中首先显示了 20 行空白。我的代码有问题吗？或者我可以做些什么来摆脱空白？

Answer 1

因为你有这样的元素：

<article class="article-short">
<div class="thumb"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather"><img alt="FILE: Boys who have undergone a circumcision ceremony walk near Qunu in the Eastern Cape in 2013. Picture: AFP." height="147" src="http://ewn.co.za/cdn/-%2fmedia%2f3C37CB28056746CD95FC913757AAD41C.ashx%3fas%3d1%26h%3d147%26w%3d234%26crop%3d1;waeb9b8157b3e310df" width="234"/></a></div>
<h6 class="h6-mega"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather">Contralesa against scrapping initiation due to cold weather</a></h6>
</article>

其中第一个 link 包含图像但没有文本。

您可能应该寻找 h6 标签。所以，像这样的东西有效：

import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
    title = (i.h6.text.replace('\n', '') if i.h6 else contents[0]).strip()
    if title:
        print(title)

如何摆脱文本上方的空白，使用 bs4

How to get rid of whitespace above text, using bs4

python

parsing

python-3.x

python-requests

bs4