遍历一组 URL 并以 CSV 格式收集数据输出

Question

类似于此线程和任务

我有一个问题：如何迭代一组 700 个 Urls 以获取 CSV 格式的 700 个数字集线器的数据（或 Excel-formate）-

查看我们拥有数据的页面：显示在此处

https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool

像这里这样的 url 列表；

https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/13281/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1417/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1349/view

等等等等

问题：我们是否也可以将其应用于类似的任务：将数据 collection 视为数字中心：我已经用这个将一个抓取器应用到一个站点 - 它可以工作 - 但是如何实现 csv-output 到在 url 上迭代的抓取器我们可以将输出也放入 csv 中吗 - 同时应用相同的技术！？

我想将网络抓取的段落与最近从 hubCards 抓取的标题配对：我目前正在将 hubCards 作为单页抓取以找到方法，但是，我想抓取所有 700 张卡片标题，以便我可以在 CSV 文件中一起查看数据。我想将结果写入适当的甲酸盐——可能是一个 csv 文件。注意：我们有以下h2标题；

注意：我们在每个 HubCard 上都有以下标题：

Title: (probably a h4 tag)
Contact: 
Description:
'Organization', 
'Evolutionary Stage', 
'Geographical Scope', 
'Funding', 
'Partners', 
'Technologies'

我的单个页面是这样的：

from bs4 import BeautifulSoup
import requests

page_link = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view'
page_response = requests.get(page_link,verify=False, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
textContent = []
for tag in page_content.find_all('h4')[1:]:
    texth4=tag.text.strip()
    textContent.append(texth4)
    for item in tag.find_next_siblings('p'):
        if texth4 in item.find_previous_siblings('h4')[0].text.strip():
            textContent.append(item.text.strip())

print(textContent)

控制台输出：

Description', 'Link to national or regional initiatives for digitising industry', 'Market and Services', 'Service Examples', 'Leveraging the holding system "EndoTAIX" from scientific development to ready-to -market', 'For one of SurgiTAIX AG\'s products, the holding system "EndoTAIX" for surgical instrument fixation, the SurgiTAIX AG cooperated very closely with the RWTH University\'s Helmholtz institute. The services provided comprised the complete first phase of scientific development. Besides, after the first concepts of the holding system took shape, a prototype was successfully build in the scope of a feasibility study. In the role regarding the self-conception as a transfer service provider offering services itself, the SurgiTAIX AG refined the technology to market level and successfully performed all the steps necessary within the process to the approval and certification of the product. Afterwards, the product was delivered to another vendor with SurgiTAIX AG carrying out the production process as an OEM.', 'Development of a self-adapting robotic rehabilitation system', 'Based on the expertise of different partners of the hub, DIERS International GmbH (SME) was enabled to develop a self-adapting robotic rehabilitation system that allows patients after stroke to relearn motion patterns autonomously. The particular challenge of this cooperation was to adjust the robot to the individual and actual needs of the patient at any particular time of the exercise. Therefore, different sensors have been utilized to detect the actual movement performance of the patient. Feature extraction algorithms have been developed to identify the actual needs of the individual patient and intelligent predicting control algorithms enable the robot to independently adapt the movement task to the needs of the patient. These challenges could be solved only by the services provided by different partners of the hub which include the transfer of the newly developed technologies, access to patient data, acquisition of knowledge and demands from healthcare personal and coordinating the application for public funding.', 'Establishment of a robotic couch lab and test facility for radiotherapy', 'With the help of services provided by different partners of the hub, the robotic integrator SME BEC GmbH was given the opportunity to enhance their robotic patient positioning device "ExaMove" to allow for compensation of lung tumor movements during free breathing. The provided services solved the need to establish a test facility within the intended environment (the radiotherapy department) and provided the transfer of necessary innovative technologies such as new sensors and intelligent automatic control algorithms. Furthermore, the provided services included the coordination of the consortium, identifying, preparing and coordinating the application for public funding, provision of access to the hospital’s infrastructure and the acquisition of knowledge and demands from healthcare personal.', 'Organization', 'Evolutionary Stage', 'Geographical Scope', 'Funding', 'Partners', 'Technologies']

到目前为止一切顺利：现在的目标是有一个很好的解决方案：如何迭代一组 700 个 Urls（换句话说，700 个 hubCards）以获取 CSV 格式的 700 个数字集线器的数据（或 Excel-formate)?

Answer 1

您可以使用 class="hubCardTitle" 遍历标签，然后使用 zip():

遍历下一个元素

import requests
import pandas as pd
from bs4 import BeautifulSoup

urls = [
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/13281/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1417/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1349/view",
]


out = []
for url in urls:
    print(f"Getting {url}")
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    d = {"URL": url, "Title": soup.h2.text}

    titles = soup.select("div.hubCardTitle")
    content = soup.select("div.hubCardTitle + div")

    for t, c in zip(titles, content):
        t = t.get_text(strip=True)
        c = c.get_text(strip=True, separator="\n")
        d[t] = c

    out.append(d)

df = pd.DataFrame(out)
df.to_csv("data.csv", index=False)

创建 data.csv（来自 LibreOffice 的屏幕截图）：

遍历一组 URL 并以 CSV 格式收集数据输出

iterate over a set of URLs and gather the output of data in CSV formate

python

csv

beautifulsoup

pandas