Beautiful Soup children 的额外换行符
Extra newline character for children of Beautiful Soup
我在 html 的片段上使用 BeautifulSoup
如下:
s = """<div class="views-row views-row-1 views-row-odd views-row- first">
<span class="views-field views-field-title">
<span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
</span>
</span>
<span class="views-field views-field-created">
<span class="field-content">Friday, March 20, 2015
</span>
</span>
</div>"""
soup = BeautifulSoup(s)
为什么 s.span
只有 return 第一个 span 标签?
而且s.contents return是一个长度为4的列表。两个span标签都在列表中,但第0和第2个索引是“\n$换行符。换行符没有用.这样做有什么原因吗?
Why does s.span only return the first span tag?
s.span
是 s.find('span')
的快捷方式,它将仅查找 第一次出现 的 span
标签。
Moreover s.contents returns a list of length 4. Both span tags are in the list but the 0th and 2nd index are "\n$ new line characters. The new line character is useless. Is there a reason why this is done?
根据定义,.contents
outputs a list of all element's children, including text nodes - instances of NavigableString
class。
如果只需要标签,可以使用find_all()
:
soup.find_all()
并且,如果只有 span
个标签:
soup.find_all('span')
示例:
>>> from bs4 import BeautifulSoup
>>> s = """<div class="views-row views-row-1 views-row-odd views-row- first">
... <span class="views-field views-field-title">
... <span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
... </span>
... </span>
... <span class="views-field views-field-created">
... <span class="field-content">Friday, March 20, 2015
... </span>
... </span>
... </div>"""
>>> soup = BeautifulSoup(s)
>>> for span in soup.find_all('span'):
... print span.text.strip()
...
Love Heals
Love Heals
Friday, March 20, 2015
Friday, March 20, 2015
重复的原因是有嵌套的span
个元素。您可以通过不同的方式修复它。例如,您可以仅使用 recursive=False
:
在 div
内进行搜索
>>> for span in soup.find('div', class_='views-row-1').find_all('span', recursive=False):
... print span.text.strip()
...
Love Heals
Friday, March 20, 2015
或者,您可以使用 CSS Selectors
:
>>> for span in soup.select('div.views-row-1 > span'):
... print span.text.strip()
...
Love Heals
Friday, March 20, 2015
我在 html 的片段上使用 BeautifulSoup
如下:
s = """<div class="views-row views-row-1 views-row-odd views-row- first">
<span class="views-field views-field-title">
<span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
</span>
</span>
<span class="views-field views-field-created">
<span class="field-content">Friday, March 20, 2015
</span>
</span>
</div>"""
soup = BeautifulSoup(s)
为什么 s.span
只有 return 第一个 span 标签?
而且s.contents return是一个长度为4的列表。两个span标签都在列表中,但第0和第2个索引是“\n$换行符。换行符没有用.这样做有什么原因吗?
Why does s.span only return the first span tag?
s.span
是 s.find('span')
的快捷方式,它将仅查找 第一次出现 的 span
标签。
Moreover s.contents returns a list of length 4. Both span tags are in the list but the 0th and 2nd index are "\n$ new line characters. The new line character is useless. Is there a reason why this is done?
根据定义,.contents
outputs a list of all element's children, including text nodes - instances of NavigableString
class。
如果只需要标签,可以使用find_all()
:
soup.find_all()
并且,如果只有 span
个标签:
soup.find_all('span')
示例:
>>> from bs4 import BeautifulSoup
>>> s = """<div class="views-row views-row-1 views-row-odd views-row- first">
... <span class="views-field views-field-title">
... <span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
... </span>
... </span>
... <span class="views-field views-field-created">
... <span class="field-content">Friday, March 20, 2015
... </span>
... </span>
... </div>"""
>>> soup = BeautifulSoup(s)
>>> for span in soup.find_all('span'):
... print span.text.strip()
...
Love Heals
Love Heals
Friday, March 20, 2015
Friday, March 20, 2015
重复的原因是有嵌套的span
个元素。您可以通过不同的方式修复它。例如,您可以仅使用 recursive=False
:
div
内进行搜索
>>> for span in soup.find('div', class_='views-row-1').find_all('span', recursive=False):
... print span.text.strip()
...
Love Heals
Friday, March 20, 2015
或者,您可以使用 CSS Selectors
:
>>> for span in soup.select('div.views-row-1 > span'):
... print span.text.strip()
...
Love Heals
Friday, March 20, 2015