如何使用 python 中的 xpath 在源代码中获取图像 src url

Question

所以，我正在开发一个程序来从网站下载一些图像，我必须以某种方式获得 img 标签的 "src" 部分。我能够用 selenium 做到这一点，但我必须调整代码，现在我正在使用 BeautifulSoup4 和 lxml。我目前在一个变量 "mystr" 中拥有页面（站点）的全部源代码，我想提供一个 xpath 并在该变量中找到该 xpath？可能吗？（大概）我发布这个问题的原因是因为我似乎无法将变量解析为 lxml 并使用它的函数 .xpath()

--阅读问题的更多背景-- 我正在从 excel 文件中读取一些数据（参考值和 url 的），我想打开 url，下载产品图片，并重命名它以供参考。我可以对多张图片执行此操作，但是当 url 只有一张图片时，我想使用 xpath 下载图片，我不想再次使用 selenium。

提前致谢。我认为这是对这个问题很重要的代码部分。

try: #Extrair o html
    fp = urllib.request.urlopen(links[i])
    mybytes = fp.read()
    mystr = mybytes.decode("utf8")
    fp.close()
except Exception as ex: #Exceção do html
    print("Não foi possivel extrair o HTML deste url")
    erros.append(i)
    continue                
try: #Passar para Beautiful soup 4
    soup = BeautifulSoup(mystr, "lxml")
    #print(mystr, file = open("teste.txt", "a"))
except Exception as ex: # Exceção do Beautiful soup 4
    print("Não foi possivel converter o HTML para bs4\n\n" + ex)
    erros.append(i)
    continue
try: #Navegar até ao DIV dentro do html extraido
    main_div = soup.find_all("div", {"id": div_id})
    if len(main_div) == 0:
        parser = etree.HTMLParser()
        tree = etree.parse(mybytes, parser)
        #print(tree, file=open("tree.txt", "a"))
        #image = tree.xpath('//*[@id="image"]')
        image = tree.xpath("/html/body/div[1]/div/div/div/div[1]/div[1]/div[1]/a/img")
        print(image[0].tag)
        #input("--------------------------------------------------")
except Exception as ex: #Exceção se não existir um div dentro do HTML extraido com o ID fornecido
    print("Não existe nenhum DIV com o id fornecidon\n\n" + ex)
    erros.append(i)
    continue

Answer 1

要开始使用 xpath，请查看 http: wiki/XPath 或有关使用 XPATHS 的更多信息。 //a/@href' 从所有链接（标签）中选择 href 属性。对于所有图像 src 属性，这将是 //img/@src.

Answer 2

一种BeautifulSoup方式：

img_src=soup.find("img")["src"]

一种lxml etree方式：

img_src=tree.xpath('//img')[0].attrib.get('src')

如何使用 python 中的 xpath 在源代码中获取图像 src url

How to get image src url in source code with xpath in python

html

python

xpath

lxml

beautifulsoup