使用 Python、BeautifulSoup、Pandas 从 .csv 中读取 URL 并在前面附加抓取结果
Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas
尽管我很无知,但我得到的这段代码几乎可以正常工作。请大家帮忙上家运行!
- 问题 1:输入:
我有一长串 URLs (1000+) 可供阅读,它们位于 .csv 中的单个列中。我宁愿从该文件中读取也不愿将它们粘贴到代码中,如下所示。
- 问题 2:输出:
源文件实际上有 3 个驱动程序和 3 个挑战。在一个单独的 python 文件中,下面的代码查找、打印并保存所有 3 个,但当我在下面使用此数据框时却没有(见下文 - 它只保存 2 个)。
- 问题 3:输出:
我希望输出(两个文件)在第 0 列中有 URLs,然后在以下列中有驱动程序(或挑战)。但是我在这里写的(可能是 'drop')使它们不仅下降了一行,而且还移动了 2 列。
最后我展示了输入以及当前和期望的输出。很抱歉这个问题很长。如果有任何帮助,我将不胜感激!
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data = []
for x in toc.select('li:-soup-contains-own("Market drivers") li'):
data.append(x.get_text(strip=True))
df = pd.DataFrame(data, columns=[url])
dataframes.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes)
tdata = df2.T
tdata.to_csv(f'detail-dr.csv', header=True)
get_drivers()
def get_challenges():
data = []
for y in toc.select('li:-soup-contains-own("Market challenges") li'):
data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
df = pd.DataFrame(data, columns=[url])
dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes2)
tdata = df2.T
tdata.to_csv(f'detail-ch.csv', header=True)
get_challenges()
每个 URL 中的输入看起来像这样。它们只是列表:
市场驱动因素
- 晶圆厂投资增加
- 电子产品小型化
- 物联网设备需求增加
市场挑战
- 半导体行业的快速技术变革
- 半导体行业波动
- 技术鸿沟的影响Table驱动因素和挑战的影响
我希望驱动程序的输出是:
0
1
2
3
http/.../Global-Induction-Hobs-30196623/
Product innovations and new designs
Increasing demand for convenient home appliances with changes in lifestyle patterns
Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/
Demand for automated recruitment processes
Increasing demand for unified solutions for all HR functions
Increasing workforce diversity
http/.../Global-Probe-Card-30196643/
Growing investment in fabs
Miniaturization of electronic products
Increasing demand for IoT devices
但是我得到:
0
1
2
3
4
5
6
http/.../Global-Induction-Hobs-30196623/
Increasing demand for convenient home appliances with changes in lifestyle patterns
Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/
Increasing demand for unified solutions for all HR functions
Increasing workforce diversity
http/.../Global-Probe-Card-30196643/
Miniaturization of electronic products
Increasing demand for IoT devices
将您的数据存储在一个字典列表中,从中创建一个数据框。将 drivers
/ challenges
的列表拆分为单个 columns
并将其连接到最终数据框。
例子
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data.append({
'url':url,
'type':'driver',
'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
})
get_drivers()
def get_challenges():
data.append({
'url':url,
'type':'challenges',
'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
})
get_challenges()
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')
输出
url
type
0
1
2
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/
driver
Product innovations and new designs
Increasing demand for convenient home appliances with changes in lifestyle patterns
Growing adoption of energy-efficient appliances
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/
challenges
High cost limiting the adoption in the mass segment
Health hazards related to induction hobs
Limitation of using only flat - surface utensils and induction-specific cookware
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/
driver
Demand for automated recruitment processes
Increasing demand for unified solutions for all HR functions
Increasing workforce diversity
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/
challenges
Threat from open-source software
High implementation and maintenance cost
Threat to data security
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/
driver
Growing investment in fabs
Miniaturization of electronic products
Increasing demand for IoT devices
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/
challenges
Rapid technological changes in semiconductor industry
Volatility in semiconductor industry
Impact of technology chasm
尽管我很无知,但我得到的这段代码几乎可以正常工作。请大家帮忙上家运行!
- 问题 1:输入:
我有一长串 URLs (1000+) 可供阅读,它们位于 .csv 中的单个列中。我宁愿从该文件中读取也不愿将它们粘贴到代码中,如下所示。
- 问题 2:输出:
源文件实际上有 3 个驱动程序和 3 个挑战。在一个单独的 python 文件中,下面的代码查找、打印并保存所有 3 个,但当我在下面使用此数据框时却没有(见下文 - 它只保存 2 个)。
- 问题 3:输出:
我希望输出(两个文件)在第 0 列中有 URLs,然后在以下列中有驱动程序(或挑战)。但是我在这里写的(可能是 'drop')使它们不仅下降了一行,而且还移动了 2 列。
最后我展示了输入以及当前和期望的输出。很抱歉这个问题很长。如果有任何帮助,我将不胜感激!
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data = []
for x in toc.select('li:-soup-contains-own("Market drivers") li'):
data.append(x.get_text(strip=True))
df = pd.DataFrame(data, columns=[url])
dataframes.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes)
tdata = df2.T
tdata.to_csv(f'detail-dr.csv', header=True)
get_drivers()
def get_challenges():
data = []
for y in toc.select('li:-soup-contains-own("Market challenges") li'):
data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
df = pd.DataFrame(data, columns=[url])
dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes2)
tdata = df2.T
tdata.to_csv(f'detail-ch.csv', header=True)
get_challenges()
每个 URL 中的输入看起来像这样。它们只是列表:
市场驱动因素
- 晶圆厂投资增加
- 电子产品小型化
- 物联网设备需求增加
市场挑战
- 半导体行业的快速技术变革
- 半导体行业波动
- 技术鸿沟的影响Table驱动因素和挑战的影响
我希望驱动程序的输出是:
0 | 1 | 2 | 3 |
---|---|---|---|
http/.../Global-Induction-Hobs-30196623/ | Product innovations and new designs | Increasing demand for convenient home appliances with changes in lifestyle patterns | Growing adoption of energy-efficient appliances |
http/.../Global-Human-Capital-Management-30196628/ | Demand for automated recruitment processes | Increasing demand for unified solutions for all HR functions | Increasing workforce diversity |
http/.../Global-Probe-Card-30196643/ | Growing investment in fabs | Miniaturization of electronic products | Increasing demand for IoT devices |
但是我得到:
0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
http/.../Global-Induction-Hobs-30196623/ | Increasing demand for convenient home appliances with changes in lifestyle patterns | Growing adoption of energy-efficient appliances | ||||
http/.../Global-Human-Capital-Management-30196628/ | Increasing demand for unified solutions for all HR functions | Increasing workforce diversity | ||||
http/.../Global-Probe-Card-30196643/ | Miniaturization of electronic products | Increasing demand for IoT devices |
将您的数据存储在一个字典列表中,从中创建一个数据框。将 drivers
/ challenges
的列表拆分为单个 columns
并将其连接到最终数据框。
例子
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data.append({
'url':url,
'type':'driver',
'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
})
get_drivers()
def get_challenges():
data.append({
'url':url,
'type':'challenges',
'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
})
get_challenges()
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')
输出
url | type | 0 | 1 | 2 |
---|---|---|---|---|
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ | driver | Product innovations and new designs | Increasing demand for convenient home appliances with changes in lifestyle patterns | Growing adoption of energy-efficient appliances |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ | challenges | High cost limiting the adoption in the mass segment | Health hazards related to induction hobs | Limitation of using only flat - surface utensils and induction-specific cookware |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ | driver | Demand for automated recruitment processes | Increasing demand for unified solutions for all HR functions | Increasing workforce diversity |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ | challenges | Threat from open-source software | High implementation and maintenance cost | Threat to data security |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ | driver | Growing investment in fabs | Miniaturization of electronic products | Increasing demand for IoT devices |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ | challenges | Rapid technological changes in semiconductor industry | Volatility in semiconductor industry | Impact of technology chasm |