使用 BeautifulSoup 抓取多个(市场索引)网站
Scraping multiple (market index) sites with BeautifulSoup
我正在开发以下代码以从特定网站来源抓取财务数据。
import requests
import pandas as pd
urls = ['https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow/quarter',
'https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow/quarter']
def main(urls):
with requests.Session() as req:
goal = []
for url in urls:
r = req.get(url)
df = pd.read_html(
r.content, match="Cash Dividends Paid - Total")[0].iloc[[0], 3:6]
goal.append(df)
new = pd.concat(goal)
print(new)
main(urls)
我正在获取我需要的信息。
2017 2018 2019 30-Sep-2019 31-Dec-2019 31-Mar-2020
0 (12.77B) (13.71B) (14.12B) NaN NaN NaN
0 NaN NaN NaN (3.48B) (3.54B) (3.38B)
0 (11.85B) (12.7B) (13.81B) NaN NaN NaN
0 NaN NaN NaN (3.51B) (3.89B) (3.88B)
我需要至少抓取 20 家公司(来自同一来源)。
URL 除了一个元素外基本相同(我将其称为 index)
https://www.marketwatch.com/investing/stock/' + index + '/financials/cash-flow'
有没有办法添加一个名为Index
的变量
并使用变量 Index
进行迭代
类似于:
import requests
import pandas as pd
Index = 'MSFT, AAPL'
和
urls = ['https://www.marketwatch.com/investing/stock/' + Index + '/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/' + Index + '/financials/cash-flow/quarter']
只是简单的解决方案,您可以使用循环内循环和字符串格式化来构造所需的URL。
例如:
import requests
import pandas as pd
indexes = 'aapl', 'MSFT', 'F'
def main(indexes):
urls = ['https://www.marketwatch.com/investing/stock/{index}/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/{index}/financials/cash-flow/quarter']
goal = []
with requests.Session() as req:
for index in indexes:
for url in urls:
url = url.format(index=index)
print('Processing url', url)
r = req.get(url)
df = pd.read_html(
r.content, match="Cash Dividends Paid - Total")[0].iloc[[0], 3:6]
goal.append(df)
new = pd.concat(goal)
print(new)
main(indexes)
打印:
Processing url https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow
Processing url https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow/quarter
Processing url https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow
Processing url https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow/quarter
Processing url https://www.marketwatch.com/investing/stock/F/financials/cash-flow
Processing url https://www.marketwatch.com/investing/stock/F/financials/cash-flow/quarter
2017 2018 2019 30-Sep-2019 31-Dec-2019 31-Mar-2020
0 (12.77B) (13.71B) (14.12B) NaN NaN NaN
0 NaN NaN NaN (3.48B) (3.54B) (3.38B)
0 (11.85B) (12.7B) (13.81B) NaN NaN NaN
0 NaN NaN NaN (3.51B) (3.89B) (3.88B)
0 (2.58B) (2.91B) (2.39B) NaN NaN NaN
0 NaN NaN NaN (598M) (595M) (596M)
我正在开发以下代码以从特定网站来源抓取财务数据。
import requests
import pandas as pd
urls = ['https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow/quarter',
'https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow/quarter']
def main(urls):
with requests.Session() as req:
goal = []
for url in urls:
r = req.get(url)
df = pd.read_html(
r.content, match="Cash Dividends Paid - Total")[0].iloc[[0], 3:6]
goal.append(df)
new = pd.concat(goal)
print(new)
main(urls)
我正在获取我需要的信息。
2017 2018 2019 30-Sep-2019 31-Dec-2019 31-Mar-2020
0 (12.77B) (13.71B) (14.12B) NaN NaN NaN
0 NaN NaN NaN (3.48B) (3.54B) (3.38B)
0 (11.85B) (12.7B) (13.81B) NaN NaN NaN
0 NaN NaN NaN (3.51B) (3.89B) (3.88B)
我需要至少抓取 20 家公司(来自同一来源)。 URL 除了一个元素外基本相同(我将其称为 index)
https://www.marketwatch.com/investing/stock/' + index + '/financials/cash-flow'
有没有办法添加一个名为Index
的变量并使用变量 Index
进行迭代类似于:
import requests
import pandas as pd
Index = 'MSFT, AAPL'
和
urls = ['https://www.marketwatch.com/investing/stock/' + Index + '/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/' + Index + '/financials/cash-flow/quarter']
只是简单的解决方案,您可以使用循环内循环和字符串格式化来构造所需的URL。
例如:
import requests
import pandas as pd
indexes = 'aapl', 'MSFT', 'F'
def main(indexes):
urls = ['https://www.marketwatch.com/investing/stock/{index}/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/{index}/financials/cash-flow/quarter']
goal = []
with requests.Session() as req:
for index in indexes:
for url in urls:
url = url.format(index=index)
print('Processing url', url)
r = req.get(url)
df = pd.read_html(
r.content, match="Cash Dividends Paid - Total")[0].iloc[[0], 3:6]
goal.append(df)
new = pd.concat(goal)
print(new)
main(indexes)
打印:
Processing url https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow
Processing url https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow/quarter
Processing url https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow
Processing url https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow/quarter
Processing url https://www.marketwatch.com/investing/stock/F/financials/cash-flow
Processing url https://www.marketwatch.com/investing/stock/F/financials/cash-flow/quarter
2017 2018 2019 30-Sep-2019 31-Dec-2019 31-Mar-2020
0 (12.77B) (13.71B) (14.12B) NaN NaN NaN
0 NaN NaN NaN (3.48B) (3.54B) (3.38B)
0 (11.85B) (12.7B) (13.81B) NaN NaN NaN
0 NaN NaN NaN (3.51B) (3.89B) (3.88B)
0 (2.58B) (2.91B) (2.39B) NaN NaN NaN
0 NaN NaN NaN (598M) (595M) (596M)