使用 Python、BeautifulSoup、Pandas 从 .csv 中读取 URL 并在前面附加抓取结果

Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas

尽管我很无知,但我得到的这段代码几乎可以正常工作。请大家帮忙上家运行!

我有一长串 URLs (1000+) 可供阅读,它们位于 .csv 中的单个列中。我宁愿从该文件中读取也不愿将它们粘贴到代码中,如下所示。

源文件实际上有 3 个驱动程序和 3 个挑战。在一个单独的 python 文件中,下面的代码查找、打印并保存所有 3 个,但当我在下面使用此数据框时却没有(见下文 - 它只保存 2 个)。

我希望输出(两个文件)在第 0 列中有 URLs,然后在以下列中有驱动程序(或挑战)。但是我在这里写的(可能是 'drop')使它们不仅下降了一行,而且还移动了 2 列。

最后我展示了输入以及当前和期望的输出。很抱歉这个问题很长。如果有任何帮助,我将不胜感激!

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data = []
        for x in toc.select('li:-soup-contains-own("Market drivers") li'):
            data.append(x.get_text(strip=True))
        df = pd.DataFrame(data, columns=[url])
        dataframes.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes)
        tdata = df2.T
        tdata.to_csv(f'detail-dr.csv', header=True)

    get_drivers()


    def get_challenges():
        data = []
        for y in toc.select('li:-soup-contains-own("Market challenges") li'):
            data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
        df = pd.DataFrame(data, columns=[url])
        dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes2)
        tdata = df2.T
        tdata.to_csv(f'detail-ch.csv', header=True)

    get_challenges()

每个 URL 中的输入看起来像这样。它们只是列表:

市场驱动因素

市场挑战

我希望驱动程序的输出是:

0 1 2 3
http/.../Global-Induction-Hobs-30196623/ Product innovations and new designs Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/ Demand for automated recruitment processes Increasing demand for unified solutions for all HR functions Increasing workforce diversity
http/.../Global-Probe-Card-30196643/ Growing investment in fabs Miniaturization of electronic products Increasing demand for IoT devices

但是我得到:

0 1 2 3 4 5 6
http/.../Global-Induction-Hobs-30196623/ Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/ Increasing demand for unified solutions for all HR functions Increasing workforce diversity
http/.../Global-Probe-Card-30196643/ Miniaturization of electronic products Increasing demand for IoT devices

将您的数据存储在一个字典列表中,从中创建一个数据框。将 drivers / challenges 的列表拆分为单个 columns 并将其连接到最终数据框。

例子

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data.append({
            'url':url,
            'type':'driver',
            'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })

    get_drivers()


    def get_challenges():
        data.append({
            'url':url,
            'type':'challenges',
            'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
        })

    get_challenges()

    
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')

输出

url type 0 1 2
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ driver Product innovations and new designs Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ challenges High cost limiting the adoption in the mass segment Health hazards related to induction hobs Limitation of using only flat - surface utensils and induction-specific cookware
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ driver Demand for automated recruitment processes Increasing demand for unified solutions for all HR functions Increasing workforce diversity
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ challenges Threat from open-source software High implementation and maintenance cost Threat to data security
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ driver Growing investment in fabs Miniaturization of electronic products Increasing demand for IoT devices
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ challenges Rapid technological changes in semiconductor industry Volatility in semiconductor industry Impact of technology chasm