以下代码中的uni代码编码错误是什么
what is the uni code encoding error in the code below
我收到以下 unicode 编码错误。
当我运行下面显示的程序时,我得到一个 unicode encoding-related 错误
import bs4
import requests
from xhtml2pdf import pisa # import python module
from xhtml2pdf.config.httpconfig import httpConfig
res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")
sourceHtml =str(pf)
outputFilename = "test.pdf"
def convertHtmlToPdf(sourceHtml, outputFilename):
# open output file for writing (truncated binary)
httpConfig.save_keys('nosslcheck', True)
resultFile = open(outputFilename, "w+b")
# convert HTML to PDF
pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding="utf-8")
# close output file
resultFile.close() # close output file
# return True on success and False on errors
return pisaStatus.err
# Main program
if __name__ == "__main__":
pisa.showLogging()
convertHtmlToPdf(sourceHtml, outputFilename)
错误如下
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)
我正在尝试使用 xhtml2pdf 下载网站的一部分。为此,我使用了 bs4 并抓取网站并存储它。然后使用 xhtml2pdf 将其保存为 pdf。
大多数时候它就像魅力一样工作。但是对于这种情况,它给了我错误。 Link 到 github 中的完整代码如下
Link 到完整代码可用 here
xhtml2pdf 使用 ascii 编码,由于我的 html 文件包含非 ascii 字符,因此显示错误。而且我不知道如何更改 xhtml2pdf 中的编码器。省略 non-ascii 字符不是一种选择。如果我忽略它,那么 link 图像将被损坏并且图像将不会以 pdf 格式显示。
完整的回溯
```Traceback (most recent call last):
File "test3.py", line 80, in
convertHtmlToPdf(sourceHtml, outputFilename)
File "test3.py", line 68, in convertHtmlToPdf
pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding= 'utf-8')
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 97, in pisaDocument
encoding, context=context, xml_output=xml_output)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 59, in pisaStory
pisaParser(src, context, default_css, xhtml, encoding, xml_output)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 759, in pisaParser
pisaLoop(document, context)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 700, in pisaLoop
pisaLoop(node, context, path, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
pisaLoop(nnode, context, path, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
pisaLoop(nnode, context, path, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
pisaLoop(nnode, context, path, **kw)
[Previous line repeated 2 more times]
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 514, in pisaLoop
attr = pisaGetAttributes(context, node.tagName, node.attributes)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 124, in pisaGetAttributes
nv = c.getFile(nv)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\context.py", line 818, in getFile
return getFile(name, relative or self.pathDirectory)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 738, in getFile
file = pisaFileObject(*a, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 644, in init
conn.request("GET", path)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1240, in _send_request
self.putrequest(method, url, **skips)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1107, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)
问题是检索到的 html 包含 img
标签,其中一些 src
属性是包含 '\u2019'
('RIGHT SINGLE QUOTATION MARK') 字符的 url .
xhtml2pdf 将这些 url 传递给 python 的 http.client 模块,而不先转义它们。 http.client 试图在检索之前将 url 编码为 ASCII,但错误发生了。
这可以通过在生成 pdf 之前转义检索到的 html 中的 url 来解决。
urllib.parse 提供了执行此操作的工具。
from urllib import parse
...
res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")
imgs = pf.find_all('img')
for img in imgs:
url = img['src']
scheme, netloc, path, params, query, fragment = parse.urlparse(url)
new_path = parse.quote(path)
new_url = parse.urlunparse((scheme, netloc, new_path, params, query, fragment))
img['src'] = new_url
sourceHtml =str(pf)
outputFilename = "test.pdf"
...
this question 的答案提供了一些关于 unicode 和 url 的背景信息。
我收到以下 unicode 编码错误。
当我运行下面显示的程序时,我得到一个 unicode encoding-related 错误
import bs4
import requests
from xhtml2pdf import pisa # import python module
from xhtml2pdf.config.httpconfig import httpConfig
res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")
sourceHtml =str(pf)
outputFilename = "test.pdf"
def convertHtmlToPdf(sourceHtml, outputFilename):
# open output file for writing (truncated binary)
httpConfig.save_keys('nosslcheck', True)
resultFile = open(outputFilename, "w+b")
# convert HTML to PDF
pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding="utf-8")
# close output file
resultFile.close() # close output file
# return True on success and False on errors
return pisaStatus.err
# Main program
if __name__ == "__main__":
pisa.showLogging()
convertHtmlToPdf(sourceHtml, outputFilename)
错误如下
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)
我正在尝试使用 xhtml2pdf 下载网站的一部分。为此,我使用了 bs4 并抓取网站并存储它。然后使用 xhtml2pdf 将其保存为 pdf。 大多数时候它就像魅力一样工作。但是对于这种情况,它给了我错误。 Link 到 github 中的完整代码如下
Link 到完整代码可用 here
xhtml2pdf 使用 ascii 编码,由于我的 html 文件包含非 ascii 字符,因此显示错误。而且我不知道如何更改 xhtml2pdf 中的编码器。省略 non-ascii 字符不是一种选择。如果我忽略它,那么 link 图像将被损坏并且图像将不会以 pdf 格式显示。
完整的回溯
```Traceback (most recent call last):
File "test3.py", line 80, in
convertHtmlToPdf(sourceHtml, outputFilename)
File "test3.py", line 68, in convertHtmlToPdf
pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding= 'utf-8')
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 97, in pisaDocument
encoding, context=context, xml_output=xml_output)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 59, in pisaStory
pisaParser(src, context, default_css, xhtml, encoding, xml_output)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 759, in pisaParser
pisaLoop(document, context)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 700, in pisaLoop
pisaLoop(node, context, path, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
pisaLoop(nnode, context, path, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
pisaLoop(nnode, context, path, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
pisaLoop(nnode, context, path, **kw)
[Previous line repeated 2 more times]
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 514, in pisaLoop
attr = pisaGetAttributes(context, node.tagName, node.attributes)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 124, in pisaGetAttributes
nv = c.getFile(nv)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\context.py", line 818, in getFile
return getFile(name, relative or self.pathDirectory)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 738, in getFile
file = pisaFileObject(*a, **kw)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 644, in init
conn.request("GET", path)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1240, in _send_request
self.putrequest(method, url, **skips)
File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1107, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)
问题是检索到的 html 包含 img
标签,其中一些 src
属性是包含 '\u2019'
('RIGHT SINGLE QUOTATION MARK') 字符的 url .
xhtml2pdf 将这些 url 传递给 python 的 http.client 模块,而不先转义它们。 http.client 试图在检索之前将 url 编码为 ASCII,但错误发生了。
这可以通过在生成 pdf 之前转义检索到的 html 中的 url 来解决。
urllib.parse 提供了执行此操作的工具。
from urllib import parse
...
res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")
imgs = pf.find_all('img')
for img in imgs:
url = img['src']
scheme, netloc, path, params, query, fragment = parse.urlparse(url)
new_path = parse.quote(path)
new_url = parse.urlunparse((scheme, netloc, new_path, params, query, fragment))
img['src'] = new_url
sourceHtml =str(pf)
outputFilename = "test.pdf"
...
this question 的答案提供了一些关于 unicode 和 url 的背景信息。