以下代码中的uni代码编码错误是什么

what is the uni code encoding error in the code below

我收到以下 unicode 编码错误。

当我运行下面显示的程序时,我得到一个 unicode encoding-related 错误

import bs4
import requests
from xhtml2pdf import pisa  # import python module
from xhtml2pdf.config.httpconfig import httpConfig

res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")

sourceHtml =str(pf)
outputFilename = "test.pdf"

def convertHtmlToPdf(sourceHtml, outputFilename):
    # open output file for writing (truncated binary)

    httpConfig.save_keys('nosslcheck', True)

    resultFile = open(outputFilename, "w+b")

    # convert HTML to PDF
    pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding="utf-8")

    # close output file
    resultFile.close()  # close output file

    # return True on success and False on errors
    return pisaStatus.err

# Main program
if __name__ == "__main__":
    pisa.showLogging()
    convertHtmlToPdf(sourceHtml, outputFilename)

错误如下

self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)

我正在尝试使用 xhtml2pdf 下载网站的一部分。为此,我使用了 bs4 并抓取网站并存储它。然后使用 xhtml2pdf 将其保存为 pdf。 大多数时候它就像魅力一样工作。但是对于这种情况,它给了我错误。 Link 到 github 中的完整代码如下

Link 到完整代码可用 here

xhtml2pdf 使用 ascii 编码,由于我的 html 文件包含非 ascii 字符,因此显示错误。而且我不知道如何更改 xhtml2pdf 中的编码器。省略 non-ascii 字符不是一种选择。如果我忽略它,那么 link 图像将被损坏并且图像将不会以 pdf 格式显示。

完整的回溯

```Traceback (most recent call last):
  File "test3.py", line 80, in 
    convertHtmlToPdf(sourceHtml, outputFilename)
  File "test3.py", line 68, in convertHtmlToPdf
    pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding= 'utf-8')
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 97, in pisaDocument
    encoding, context=context, xml_output=xml_output)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 59, in pisaStory
    pisaParser(src, context, default_css, xhtml, encoding, xml_output)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 759, in pisaParser
    pisaLoop(document, context)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 700, in pisaLoop
    pisaLoop(node, context, path, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
    pisaLoop(nnode, context, path, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
    pisaLoop(nnode, context, path, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
    pisaLoop(nnode, context, path, **kw)
  [Previous line repeated 2 more times]
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 514, in pisaLoop
    attr = pisaGetAttributes(context, node.tagName, node.attributes)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 124, in pisaGetAttributes
    nv = c.getFile(nv)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\context.py", line 818, in getFile
    return getFile(name, relative or self.pathDirectory)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 738, in getFile
    file = pisaFileObject(*a, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 644, in init
    conn.request("GET", path)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1240, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1107, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)

问题是检索到的 html 包含 img 标签,其中一些 src 属性是包含 '\u2019' ('RIGHT SINGLE QUOTATION MARK') 字符的 url .

xhtml2pdf 将这些 url 传递给 python 的 http.client 模块,而不先转义它们。 http.client 试图在检索之前将 url 编码为 ASCII,但错误发生了。

这可以通过在生成 pdf 之前转义检索到的 html 中的 url 来解决。

urllib.parse 提供了执行此操作的工具。

from urllib import parse
...
res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")

imgs = pf.find_all('img')
for img in imgs: 
    url = img['src'] 
    scheme, netloc, path, params, query, fragment = parse.urlparse(url)
    new_path = parse.quote(path)
    new_url = parse.urlunparse((scheme, netloc, new_path, params, query, fragment))
    img['src'] = new_url

sourceHtml =str(pf)
outputFilename = "test.pdf"
...

this question 的答案提供了一些关于 unicode 和 url 的背景信息。