有没有办法通过将网页的 HTML 内容强制转换为 Python 中的字符串来检索该网页的内容？

Question

我正在尝试检索网页的 HTML 内容并将其提取并作为字符串读取。但是，我有一个问题，每当我运行我的代码时，我得到一个像对象而不是字符串的字节，并且 decode() 在这种情况下似乎不起作用。

我的代码如下：

money_request = urllib.request.urlopen('website-url-here').read()

print(money_request.decode('utf-8')

运行上面的脚本会产生以下错误：

Traceback (most recent call last):
  File "E:\University Stuff\Licenta\gas_station_service.py", line 12, in <module>
    print(money_request.decode())
  File "C:\Python38\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u02bb' in position 143288: character maps to <undefined>
>>>

我还想说明一下，我已经使用 Chrome 控制台和命令 document.characterSet 检查网站是否使用 utf-8 编码。

我需要将其作为字符串检索，以便对代码行执行搜索以从 span 标记中获取值。

感谢任何帮助。

Answer 1

如果你用漂亮的汤可能会更好，因为它有助于解析成html 如果你没有这个模块安装它就像 pip install bs4 在 windows 和 pip3 install bs4 如果在 mac 或 linux 我希望请求已经存在于 python 3 如果您没有 lxml 模块，请继续使用 pip install

安装它

import requests
from bs4 import BeautifulSoup

res = request.get('website-url-here')
src = res.content
soup = BeautifulSoup(src, 'lxml')
markup = soup.prettify()
print(markup)

你会得到整页的抓取网页可能对你来说很容易提取有用的通过找到你想要的内容

soup.find_all('div', {'class', 'classname'})

这将 return 放入数组，而这不会

soup.find('div', {'class', 'classname'})

但这将return第一个内容由您选择

Answer 2

您可以简单地使用 text 来获取网站的字符串 html 代码

import requests
response = requests.get('website-url-here')
print(response.text)

有没有办法通过将网页的 HTML 内容强制转换为 Python 中的字符串来检索该网页的内容？

Is there a way to retrieve the HTML content of a web page by casting it into a string in Python?

python

urllib

web-scraping