Python

Question

我使用 mechanize 浏览网站。在此之后，我使用 beautifulsoup 来操作网页的内容（转换为 unicode，删除一些行）。现在我想从使用 Beautifulsoup 获取的 html 源创建 PDF 文件。我使用 pdfkit，它适用于文本。但现在我还想用 html 代码中的图片创建 pdf。 url 通过使用相对路径 '../../' 等指定（也适用于图片）

如何更改所有url以考虑绝对路径以及如何获取pdf文件中的图片？换个路径就可以获取图片了吗？

解决方案：（基于dudu1791提案）

#changement liens vers images
def ChangeLinkIMG(soup,baseurl):
    #parcours des images
    for imgLK in soup.findAll('img'):    
        #chemin relatif image
        try:
            relaIMG=imgLK['src'] 
            #creation lien absolu
            absoIMG=urljoin(baseurl,relaIMG)
            imgLK['src']=absoIMG
            print absoIMG
        except:
            pass
    return soup

Answer 1

它可能是答案的一半，但下面的代码可以帮助您转 url 考虑绝对路径。我就是这样做的。

def parse_all_links(self, soup):            
        for link in soup.find_all('a'):                
            if(link.get('href')):
                href = link.get('href')
                if href.startswith('http') or href.startswith('https'):
                    print(href)                        
                elif href =='#':
                    #print('No link present')
                    pass
                elif href =='/':
                    pass
                else:
                    href = baseurl + href
                    print(href)

Python - Beautifulsoup 到带图片的 PDF（相对路径）

Python - Beautifulsoup to PDF with picture (relative paths)

html

pdf

path

beautifulsoup