当我解码 Python 中的字节 (HTML) 时缺少代码(请求、BeautifulSoup、urllib)

Missing code when I decode bytes (HTML) in Python (requests, BeautifulSoup, urllib)

我对 Python 很陌生,我正在尝试获取网页的源代码以使用它们的 HTML 元素。

但是,当我将字节转换为 utf-8 时,一些 HTML 代码就消失了。这是我的代码:

import urllib.request

req = urllib.request.Request('http://avast.softonic.com/')
response = urllib.request.urlopen(req)
the_page = response.read()

例如"the_page"中ID为"review_data"的DIV的内容为:

\n\n\t\t\t\t\t\t\t\t\t\t<div id="review_data" class="track_links">\n\t\t\t\t\t\t\t\t\t\t\t\t<p><!--[lead]-->Los expertos en soluciones antivirus gratuitas conocen bien el Avast Free Antivirus 2016, y probablemente ya lo hayan instalado alguna vez. Este software es <strong>uno de los l\xc3\xadderes en su campo</strong>, proporcionando un s\xc3\xb3lido conjunto de defensas contra virus y malware, as\xc3\xad como algunas otras herramientas \xc3\xbatiles que ni se imagina. Mejor a\xc3\xban, <strong>Avast es uno de los antivirus menos intrusivos</strong>, quiz\xc3\xa1 no tanto en los \xc3\xbaltimos a\xc3\xb1os, pero sigue siendo un sistema mucho menos acaparador que los dos grandes antivirus.\r<br /><!--[/lead]--></p>\r<p><!--[features]--><!--[subfeatures]--><h3>Lleno de caracter\xc3\xadsticas.</h3><!--[/subfeatures]--></p>\r<p>Una gran ventaja del Avast Free Antivirus 2016 es su conjunto de caracter\xc3\xadsticas. Aunque estas caracter\xc3\xadsticas han provocado que el tama\xc3\xb1o de instalaci\xc3\xb3n sea mayor (se recomienda hasta 2 GB de espacio de disco duro disponible), no deber\xc3\xada resultar un problema para la mayor\xc3\xada de los discos duros modernos, adem\xc3\xa1s incluye gran cantidad de herramientas de forma gratuita. Aparte de la exploraci\xc3\xb3n antivirus est\xc3\xa1ndar, que se mantiene firme con<strong> actualizaciones peri\xc3\xb3dicas</strong>, la \xc3\xbaltima versi\xc3\xb3n de Avast tiene la seguridad de red dom\xc3\xa9stica que detecta vulnerabilidades para todos los dispositivos conectados a la red. <strong>La \xc3\xbaltima versi\xc3\xb3n, la actualizaci\xc3\xb3n \'Nitro\', tambi\xc3\xa9n a\xc3\xb1ade un navegador dedicado llamado Avast SafeZone</strong>. Aclamado como el navegador m\xc3\xa1s seguro del mundo, es a la vez un software inflado con car\xc3\xa1cter gratuito. Para aquellos a los que les importa la seguridad, especialmente en lo que se refiere a cuestiones bancarias, el programa resulta ser una bendici\xc3\xb3n. El <strong>bloqueador de anuncios incorporado</strong> puede ser un regalo del cielo a la hora de visitar ciertos sitios. Otra nueva caracter\xc3\xadstica es Cybercapture, lo que pone en cuarentena los archivos entrantes sospechosos. Las v\xc3\xadctimas de los virus sabr\xc3\xa1n la importancia de este buffer.\r<br /><!--[/features]--></p>\r<p><!--[usability]--><!--[subusability]--><h3>Una interfaz sencilla y eficaz</h3><!--[/subusability]--></p>\r<p>Avast ha cambiado varias veces a lo largo de los a\xc3\xb1os y la actualizaci\xc3\xb3n Nitro no es una excepci\xc3\xb3n, pero por suerte su dise\xc3\xb1o parece haber permanecido constante. El programa es <strong>simple y f\xc3\xa1cil de usar, con botones definidos y textos claros</strong> en colores agradables. Avast Free Antivirus 2016 se asentar\xc3\xa1 en la bandeja del sistema hasta que se necesite, al igual que la mayor\xc3\xada del software antivirus, se expande cuando se abre en una ventana peque\xc3\xb1a sin fronteras con apariencia elegante y coincide con el esquema de dise\xc3\xb1o de Windows 10. La mayor\xc3\xada de las secciones de este programa son bastante f\xc3\xa1ciles de seguir, con un gran conjunto de botones para las herramientas e iconos est\xc3\xa1ndar, como una rueda dentada para acceder a la configuraci\xc3\xb3n. Por supuesto, siempre puedes actualizar pulsando el bot\xc3\xb3n premium, anim\xc3\xa1ndole a descargar y pagar por Avast Premier. Sin embargo, esto no es obligatorio. Cada una de las principales caracter\xc3\xadsticas de Avast tiene su propia secci\xc3\xb3n, tales como la seguridad de Internet, el navegador SafeZone y la exploraci\xc3\xb3n inteligente, as\xc3\xad que realmente nada puede ir mal.\r<br /><!--[/usability]--></p>\r<p><!--[conclusion]--><!--[subconclusion]--><h3>Las mejores cosas de la vida son gratis</h3><!--[/subconclusion]--></p>\r<p>Para un programa gratuito, <strong>Avast es realmente excelente</strong>. S\xc3\xad, se ha perdido algo de su sensaci\xc3\xb3n m\xc3\xa1s independiente de ediciones pasadas, pero eso es solo un peque\xc3\xb1o precio para un software libre de estas caracter\xc3\xadsticas. Avast Free Antivirus 2016 es menos intrusivo en su navegaci\xc3\xb3n diaria y es muy sencillo de utilizar, por lo que sigue siendo una de las principales soluciones gratuitas.\r<br /><!--[/conclusion]--></p>\n\t\t\t\t\t\t\t\t\t\t\t</div>

但是当我尝试做以下任何事情时:

import urllib.request

req = urllib.request.Request('http://avast.softonic.com/')
response = urllib.request.urlopen(req)
the_page = response.read()
html_missing_elements = the_page.decode('utf-8')

或者:

import requests

r =requests.get('http://avast.softonic.com/')
html_missing_elements = r.text

或者:

import urllib.request
from bs4 import BeautifulSoup

req = urllib.request.Request('http://avast.softonic.com/')
response = urllib.request.urlopen(req)
the_page = response.read()
html_missing_elements = BeautifulSoup(the_page)

下面的例子,ID为"review_data"的DIV只包含:

<div id="review_data" class="track_links"><br /><!--[/conclusion]--></p></div>

我无法获取页面的完整原始 HTML 代码,缺少代码,我想知道为什么。

谢谢。

有一些马车returns即\r嵌入html:

\r<br /><!--[/lead]--></p>\r
>\r<p>A big plus point for Avast Free Antivirus 2016

还有很多。

一旦你删除它,一切都会在你的 IDE 中正常工作,你可以在打印时看到标签内容:

soup = BeautifulSoup(r.content.replace(b"\r",b""))
print(soup.select_one("#review_data"))

数据确实存在,你的 IDE 只是因为回车 returns:

而没有显示
 soup = BeautifulSoup(r.content,"lxml")
 print(soup.select_one("#review_data"))

使用pycharm将输出:

<div class="track_links" id="review_data">
<br/><!--[/conclusion]--></p>
</div>

但使用:

 print(soup.select_one("#review_data").text)

将输出:

\nConnoisseurs of free antivirus solutions will already know of Avast Free Antivirus 2016 and have probably installed it at some point or another. This software is one of the leaders in its field, providing a robust suite of defences against viruses and malware, as well as some other useful tools that you might not expect. Better still, Avast is one of the less intrusive antivirus programs- perhaps less so in recent years, but still a lot less system-hogging than the big two.\r Brimming with features A big plus point for Avast Free Antivirus 2016 is its suite of features. Although these features have caused its install size to increase (up to 2GB hard drive space is recommended!), it shouldn’t prove an issue for most modern hard drives and you do get a lot of tools for free. Aside from the standard antivirus scanning, which is kept sharp with constant updates, the latest version of Avast has home network security which detects vulnerabilities for all devices connected to your network. The latest version, the ‘Nitro’ update, also adds a dedicated Avast browser called SafeZone. Heralded as the world’s safest browser, this could equally be argued as bloatware and a great free feature. For those who are security conscious, especially regarding banking, it should be seen as beneficial. The in-built ad blocker can be a godsend when visiting certain sites. Another new feature is CyberCapture, which quarantines any suspicious incoming files. Victims of viruses will know the importance of this buffer.\r A simple and effective interface Avast has changed a few times over the years and the Nitro update is no different, but thankfully their design approach seems to have remained constant. The program is simple and straightforward to use, with bold buttons and clear text in friendly colours. Avast Free Antivirus 2016 will sit in the system tray until needed, like most antivirus software, then expand when opened into a small borderless window that looks sleek matching the Windows 10 design scheme. Most sections of this are easy enough to follow, with a large set of buttons for the tools and standard icons like a cog for accessing settings. Of course, you’re also never far away from a premium upgrade button, encouraging you to download and pay for Avast Premier. However, this is not forced upon you. Each of the main features of Avast has its own section, such as internet security, the SafeZone browser and Smart Scan, so you really can’t go wrong.\r The best things in life are free For a free program, Avast is pretty impressive. Yes, it has lost some of its independent feel as the years have gone by, but that’s a small price for a great bit of free software. Avast Free Antivirus 2016 will interfere with your everyday browsing less than the bigger names in software. It’s very simple to use, therefore remains one of the top free solutions.\r\n'

如果您使用 ipython 运行 相同的代码,您只需使用 soup = BeautifulSoup(r.content,"lxml"):

就会看到正确的输出
In [5]: soup = BeautifulSoup(r.content,"lxml")

In [6]: soup.select_one("#review_data")
Out[6]: 
<div class="track_links" id="review_data">
<p><!--[lead]-->Connoisseurs of free antivirus solutions will already know of Avast Free Antivirus 2016 and have probably installed it at some point or another. This software is one of the leaders in its field, providing a <strong>robust suite of defences against viruses and malware</strong>, as well as some other useful tools that you might not expect. Better still, Avast is one of the less intrusive antivirus `
<br/><!--[/lead]--></p> <p><!--[features]--><!--[subfeatures]--></p><h3>Brimming with features</h3><!--[/subfeatures]--> <p>A big plus point for Avast Free Antivirus 2016 is its suite of features. Although these features have caused its install size to increase (up to 2GB hard drive space is recommended!), it shouldn’t prove an issue for most modern hard drives and you do get a lot of tools for free.</p> <p>Aside from the standard antivirus scanning, which is kept sharp with constant updates, the latest version of Avast has <strong>home network security</strong> which detects vulnerabilities for all devices connected to your network.</p> <p>The latest version, the ‘Nitro’ update, also adds a dedicated Avast browser called <strong>SafeZone</strong>. Heralded as the world’s safest browser, this could equally be argued as bloatware and a great free feature. For those who are security conscious, especially regarding banking, it should be seen as beneficial. The in-built ad blocker can be a godsend when visiting certain sites. Another new feature is <strong>CyberCapture</strong>, which quarantines any suspicious incoming files. Victims of viruses will know the importance of this buffer.
<br/><!--[/features]--></p> <p><!--[usability]--><!--[subusability]--></p><h3>A simple and effective interface</h3><!--[/subusability]--> <p>Avast has changed a few times over the years and the <strong>Nitro update</strong> is no different, but thankfully their design approach seems to have remained constant. The program is <strong>simple and straightforward</strong> to use, with bold buttons and clear text in friendly colours.</p> <p>Avast Free Antivirus 2016 will sit in the system tray until needed, like most antivirus software, then expand when opened into a small borderless window that looks sleek matching the Windows 10 design scheme. Most sections of this are easy enough to follow, with a large set of buttons for the tools and standard icons like a cog for accessing settings.</p> <p>Of course, you’re also never far away from a premium upgrade button, encouraging you to download and pay for <a href="http://avast-premier-antivirus.en.softonic.com" title="Avast Premier">Avast Premier</a>. However, this is not forced upon you.</p> <p>Each of the main features of Avast has its own section, such as <strong>internet security</strong>, the SafeZone browser and <strong>Smart Scan</strong>, so you really can’t go wrong.
<br/><!--[/usability]--></p> <p><!--[conclusion]--><!--[subconclusion]--></p><h3>The best things in life are free</h3><!--[/subconclusion]--> <p>For a free program, Avast is pretty impressive. Yes, it has lost some of its independent feel as the years have gone by, but that’s a small price for a great bit of free software. Avast Free Antivirus 2016 will interfere with your everyday browsing less than the bigger names in software. It’s very simple to use, therefore remains <strong>one of the top free solutions</strong>.
<br/><!--[/conclusion]--></p>
</div>

它与编码无关,它只是回车 returns 干扰了您 运行 从中获取代码的输出。 运行 下面是一个简单的例子,您可以看到如何影响输出:

In [14]: s = "foo\bar"

In [15]: print(s)
foar