Beautiful Soup children 的额外换行符

Question

我在 html 的片段上使用 BeautifulSoup 如下：

 s = """<div class="views-row views-row-1 views-row-odd views-row-  first">
            <span class="views-field views-field-title"> 
                <span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
                </span> 
            </span>
            <span class="views-field views-field-created"> 
                <span class="field-content">Friday, March 20, 2015
                </span> 
           </span> 
</div>""" 

soup = BeautifulSoup(s)

为什么 s.span 只有 return 第一个 span 标签？

而且s.contents return是一个长度为4的列表。两个span标签都在列表中，但第0和第2个索引是“\n$换行符。换行符没有用.这样做有什么原因吗？

Answer 1

Why does s.span only return the first span tag?

s.span 是 s.find('span') 的快捷方式，它将仅查找 第一次出现 的 span 标签。

Moreover s.contents returns a list of length 4. Both span tags are in the list but the 0th and 2nd index are "\n$ new line characters. The new line character is useless. Is there a reason why this is done?

根据定义，.contents outputs a list of all element's children, including text nodes - instances of NavigableString class。

如果只需要标签，可以使用find_all():

soup.find_all()

并且，如果只有 span 个标签：

soup.find_all('span')

示例：

>>> from bs4 import BeautifulSoup
>>> s = """<div class="views-row views-row-1 views-row-odd views-row-  first">
...             <span class="views-field views-field-title"> 
...                 <span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
...                 </span> 
...             </span>
...             <span class="views-field views-field-created"> 
...                 <span class="field-content">Friday, March 20, 2015
...                 </span> 
...            </span> 
... </div>""" 
>>> soup = BeautifulSoup(s)
>>> for span in soup.find_all('span'):
...     print span.text.strip()
... 
Love Heals
Love Heals
Friday, March 20, 2015
Friday, March 20, 2015

重复的原因是有嵌套的span个元素。您可以通过不同的方式修复它。例如，您可以仅使用 recursive=False:

在 div 内进行搜索

>>> for span in soup.find('div', class_='views-row-1').find_all('span', recursive=False):
...     print span.text.strip()
... 
Love Heals
Friday, March 20, 2015

或者，您可以使用 CSS Selectors:

>>> for span in soup.select('div.views-row-1 > span'):
...     print span.text.strip()
... 
Love Heals
Friday, March 20, 2015

Beautiful Soup children 的额外换行符

Extra newline character for children of Beautiful Soup

html

python

beautifulsoup

html-parsing