为什么得到的htmlcontent.txt是空的?
why get the html content.txt is empty?
程序的目标是简单获取tageschau.de的标题。
一开始还正常,跑了几次就什么都没有了
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/86.0.4240.111 Safari/537.36',
'Host': 'www.tagesschau.de',
'Referer': 'https://www.tagesschau.de/'
}
# get and parse the HTML of tageschau.de
URL = 'https://www.tagesschau.de/'
html = requests.get(URL, headers=headers)
html_parse = BeautifulSoup(html.content, 'lxml')
# find all headline in homepage
elements = html_parse.find_all('h4',{'class':'headline'})
for element in elements:
print(element.txt)
一无所获。
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
但是当我使用element
而不是element.txt
时,有一些正确的输出
<h4 class="headline"><a href="/multimedia/livestreams/livestream3/">Live: tagesschau24</a></h4>
<h4 class="headline"><a href="/100sekunden/">100 Sekunden</a></h4>
<h4 class="headline"><a href="/multimedia/sendung/ts-39833.html">tagesschau 20 Uhr</a></h4>
<h4 class="headline"><a href="/multimedia/sendung/ts-39841.html">Letzte Sendung</a></h4>
<h4 class="headline">++ Fauci warnt vor "einer Menge Leid" ++</h4>
<h4 class="headline">Weniger Party, mehr Wellness</h4>
<h4 class="headline">November-Lockdown kostet 19 Milliarden</h4>
这让我很困惑,为什么?
如果你想获取元素的内部文本尝试.text
:
for element in elements:
print(element.text)
对于 innerHTML 使用 .html
:
for element in elements:
print(element.html)
程序的目标是简单获取tageschau.de的标题。 一开始还正常,跑了几次就什么都没有了
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/86.0.4240.111 Safari/537.36',
'Host': 'www.tagesschau.de',
'Referer': 'https://www.tagesschau.de/'
}
# get and parse the HTML of tageschau.de
URL = 'https://www.tagesschau.de/'
html = requests.get(URL, headers=headers)
html_parse = BeautifulSoup(html.content, 'lxml')
# find all headline in homepage
elements = html_parse.find_all('h4',{'class':'headline'})
for element in elements:
print(element.txt)
一无所获。
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
但是当我使用element
而不是element.txt
时,有一些正确的输出
<h4 class="headline"><a href="/multimedia/livestreams/livestream3/">Live: tagesschau24</a></h4>
<h4 class="headline"><a href="/100sekunden/">100 Sekunden</a></h4>
<h4 class="headline"><a href="/multimedia/sendung/ts-39833.html">tagesschau 20 Uhr</a></h4>
<h4 class="headline"><a href="/multimedia/sendung/ts-39841.html">Letzte Sendung</a></h4>
<h4 class="headline">++ Fauci warnt vor "einer Menge Leid" ++</h4>
<h4 class="headline">Weniger Party, mehr Wellness</h4>
<h4 class="headline">November-Lockdown kostet 19 Milliarden</h4>
这让我很困惑,为什么?
如果你想获取元素的内部文本尝试.text
:
for element in elements:
print(element.text)
对于 innerHTML 使用 .html
:
for element in elements:
print(element.html)