如何从存储在列表中的链接中提取信息？

Question

我想进入此列表并获取此列表中 link 后面的特定信息（姓名、地址、电话号码、来自特定公司的邮件）：

['https://allianz-entwicklung-klima.de/kompensationspartner/aera-group/',
 'https://allianz-entwicklung-klima.de/kompensationspartner/atmosfair-ggmbh/',
 'https://allianz-entwicklung-klima.de/kompensationspartner/bischoff-ditze-energy-gmbh-co-kg/',
 'https://allianz-entwicklung-klima.de/kompensationspartner/climate-extender-gmbh/',
 'https://allianz-entwicklung-klima.de/kompensationspartner/climatepartner-gmbh/',
 'https://allianz-entwicklung-klima.de/kompensationspartner/die-klimamanufaktur-gmbh/',
 'https://allianz-entwicklung-klima.de/kompensationspartner/die-ofenmacher-e-v/',
 'https://allianz-entwicklung-klima.de/kompensationspartner/first-climate/',
 'https://allianz-entwicklung-klima.de/kompensationspartner/fokus-zukunft-gmbh-co-kg/']

所有的信息最后都要存放在一个table中。我尝试了一个 for 循环，但它对我不起作用，因为我只让第一个 link 工作，而不是其他的。

我很感激任何帮助

Answer 1

我个人会使用 Selenium Web Driver 进行任何网络抓取。这将允许您使用代码自动化您的浏览器。它可以转到每个链接，select 你需要什么，存储它们的值，然后 return 它们。

Answer 2

您可以使用名为 requests 和 BeautifulSoup 的 Python 库来抓取这些网站。我在下面写了小代码，我还没有时间测试它。但它应该工作。您必须使用所需的精美汤提取信息并将其存储在字典列表中，例如：

data = [{"name": "", "address": "", "number": "", "mail": ""}]

import requests
from bs4 import BeautifulSoup

links = ['https://allianz-entwicklung-klima.de/kompensationspartner/aera-group/',
        'https://allianz-entwicklung-klima.de/kompensationspartner/atmosfair-ggmbh/',
        'https://allianz-entwicklung-klima.de/kompensationspartner/bischoff-ditze-energy-gmbh-co-kg/',
        'https://allianz-entwicklung-klima.de/kompensationspartner/climate-extender-gmbh/',
        'https://allianz-entwicklung-klima.de/kompensationspartner/climatepartner-gmbh/',
        'https://allianz-entwicklung-klima.de/kompensationspartner/die-klimamanufaktur-gmbh/',
        'https://allianz-entwicklung-klima.de/kompensationspartner/die-ofenmacher-e-v/',
        'https://allianz-entwicklung-klima.de/kompensationspartner/first-climate/',
        'https://allianz-entwicklung-klima.de/kompensationspartner/fokus-zukunft-gmbh-co-kg/']

for link in links:
    page = requests.get(link)
    soup = BeautifulSoup(page.content, "html.parser")

要了解如何提取和使用 Beautiful Soup，我建议阅读以下内容：Beautiful Soup: Build a Web Scraper With Python

如何从存储在列表中的链接中提取信息？

How can you extract information from links which are stored in a list?

python

list

permalinks