使用 Python、BeautifulSoup、Pandas 从 .csv 中读取 URL 并在前面附加抓取结果

Question

尽管我很无知，但我得到的这段代码几乎可以正常工作。请大家帮忙上家运行!

问题 1：输入：

我有一长串 URLs (1000+) 可供阅读，它们位于 .csv 中的单个列中。我宁愿从该文件中读取也不愿将它们粘贴到代码中，如下所示。

问题 2：输出：

源文件实际上有 3 个驱动程序和 3 个挑战。在一个单独的 python 文件中，下面的代码查找、打印并保存所有 3 个，但当我在下面使用此数据框时却没有（见下文 - 它只保存 2 个）。

问题 3：输出：

我希望输出（两个文件）在第 0 列中有 URLs，然后在以下列中有驱动程序（或挑战）。但是我在这里写的（可能是 'drop'）使它们不仅下降了一行，而且还移动了 2 列。

最后我展示了输入以及当前和期望的输出。很抱歉这个问题很长。如果有任何帮助，我将不胜感激！

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data = []
        for x in toc.select('li:-soup-contains-own("Market drivers") li'):
            data.append(x.get_text(strip=True))
        df = pd.DataFrame(data, columns=[url])
        dataframes.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes)
        tdata = df2.T
        tdata.to_csv(f'detail-dr.csv', header=True)

    get_drivers()


    def get_challenges():
        data = []
        for y in toc.select('li:-soup-contains-own("Market challenges") li'):
            data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
        df = pd.DataFrame(data, columns=[url])
        dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes2)
        tdata = df2.T
        tdata.to_csv(f'detail-ch.csv', header=True)

    get_challenges()

每个 URL 中的输入看起来像这样。它们只是列表：

市场驱动因素

晶圆厂投资增加
电子产品小型化
物联网设备需求增加

市场挑战

半导体行业的快速技术变革
半导体行业波动
技术鸿沟的影响Table驱动因素和挑战的影响

我希望驱动程序的输出是：

0	1	2	3
http/.../Global-Induction-Hobs-30196623/	Product innovations and new designs	Increasing demand for convenient home appliances with changes in lifestyle patterns	Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/	Demand for automated recruitment processes	Increasing demand for unified solutions for all HR functions	Increasing workforce diversity
http/.../Global-Probe-Card-30196643/	Growing investment in fabs	Miniaturization of electronic products	Increasing demand for IoT devices

但是我得到：

0	1	2	3	4	5	6
http/.../Global-Induction-Hobs-30196623/	Increasing demand for convenient home appliances with changes in lifestyle patterns	Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/			Increasing demand for unified solutions for all HR functions	Increasing workforce diversity
http/.../Global-Probe-Card-30196643/					Miniaturization of electronic products	Increasing demand for IoT devices

Answer 1

将您的数据存储在一个字典列表中，从中创建一个数据框。将 drivers / challenges 的列表拆分为单个 columns 并将其连接到最终数据框。

例子

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data.append({
            'url':url,
            'type':'driver',
            'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })

    get_drivers()


    def get_challenges():
        data.append({
            'url':url,
            'type':'challenges',
            'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
        })

    get_challenges()

    
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')

输出

url	type	0	1	2
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/	driver	Product innovations and new designs	Increasing demand for convenient home appliances with changes in lifestyle patterns	Growing adoption of energy-efficient appliances
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/	challenges	High cost limiting the adoption in the mass segment	Health hazards related to induction hobs	Limitation of using only flat - surface utensils and induction-specific cookware
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/	driver	Demand for automated recruitment processes	Increasing demand for unified solutions for all HR functions	Increasing workforce diversity
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/	challenges	Threat from open-source software	High implementation and maintenance cost	Threat to data security
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/	driver	Growing investment in fabs	Miniaturization of electronic products	Increasing demand for IoT devices
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/	challenges	Rapid technological changes in semiconductor industry	Volatility in semiconductor industry	Impact of technology chasm

使用 Python、BeautifulSoup、Pandas 从 .csv 中读取 URL 并在前面附加抓取结果

Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas

python

beautifulsoup

web-scraping

export-to-csv

pandas

例子

输出