Web 抓取动态网站以提取最近的新闻文章 URL
Webscraping Dynamic Website to Pull Recent News Article URLs
我正在尝试使用 Python 从动态网站中提取投资新闻文章。我已经尝试了几个适用于静态网站的教程,但是我在将 URL 拉到特定文章时遇到了问题。我正在使用的代码如下:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.institutionalinvestor.com/search?'
'term=&' # eventually, the term would include the words I am actively searching for
'filters=%7B"dates":%5B"last%20week"%5D%7D') # filter to the last week, this would eventually be for the last 24 hours only
r.html.absolute_links
这让我得到一个数组格式的页面内链接列表:
{'https://www.institutionalinvestor.com/Login', 'https://www.institutionalinvestor.com/display-advertising', 'http://www.ttivanguard.com/', 'https://www.riaintel.com/', 'http://interactive.institutionalinvestor.com/executive-IR-research-em/about-586KX-2742AB.html', 'https://twitter.com/iimag', 'https://myaccount.institutionalinvestor.com/Orders/SelectPackage.html', 'https://www.institutionalinvestor.com/', 'https://www.institutionalinvestor.com/Corner-Office', 'https://www.institutionalinvestor.com/Management', 'http://iimemberships.com/', 'http://www.iiconferences.com/', 'https://www.institutionalinvestor.com/Register', 'https://www.institutionalinvestor.com/cookies', 'https://www.institutionalinvestor.com/Careers', 'https://www.institutionalinvestor.com/Custom-Research', 'https://www.institutionalinvestor.com/Portfolio', 'https://www.euromoneyplc.com/modern-slavery-act-transparency-statement', 'https://www.institutionalinvestor.com/research', 'https://www.institutionalinvestor.com/Masthead', 'https://www.institutionalinvestor.com/about-thought-leadership', 'https://www.institutionalinvestor.com/Investors', 'https://www.institutionalinvestor.com/Premium', 'https://www.institutionalinvestor.com/about-us', 'https://www.institutionalinvestor.com/thought-leadership', 'https://www.institutionalinvestor.com/PrivacyPolicy', 'https://www.institutionalinvestor.com/sponsored', 'https://www.institutionalinvestor.com/Video', 'https://www.institutionalinvestor.com/How-to-Pitch-Institutional-Investor', 'https://www.institutionalinvestor.com/FAQs', 'https://www.institutionalinvestor.com/Research-FAQs', 'https://www.institutionalinvestor.com/Reprints', 'https://www.institutionalinvestor.com/TermsConditions', 'https://www.linkedin.com/company/164389', 'https://www.facebook.com/iimag', 'https://www.institutionalinvestor.com/Customer-Service', 'https://www.institutionalinvestor.com/Culture', 'https://www.institutionalinvestor.com/awards', 'https://www.institutionalinvestor.com/Research-Insight', 'http://www.sovereignwealthcenter.com/'}
但是我找不到文章本身的链接。当我检查源代码时,这是我看到的:
<div class="search-results" role="listbox">
<article class="search-result" ng-repeat="article in serverData.hits.results">
<div class="search-result-text-ghost"></div>
<h2 ng-class="article|publicationClass"><a ng-href="{{article|articleHref}}">{{article|snippet:'title'|removeHtmlTags}}</a>
</h2>
作为 HTML 的新手,最后的 h2 部分让我相信网站是动态的,这就是我被困的地方。任何帮助,将不胜感激。我对这个问题的理想输出是获取文章标题、来源(在本例中为“Institutional Investor”)、文章预览(前几行左右,以及 URL将文章放入一个数据框中,可以每天早上发送给我以节省时间,否则我会花时间手动拉取新闻。我已经将项目的其余部分放在一起,除了 Institutional Investor 等网站的新闻拉取之外,这些网站不包含在API我在用
如有必要或推荐,我愿意接受任何和所有新方法。提前致谢!
尝试使用硒
简单的工作示例
您可能需要优化一些东西,例如 baseUrl、dataframe 而不是 print、...
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
url = "https://www.institutionalinvestor.com/search"
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
sleep(3)
soup = BeautifulSoup(driver.page_source,"lxml")
for article in soup.select('div.search-results > article'):
title = article.find('h2').get_text()
link = article.find('a')['href']
print(title +': https://www.institutionalinvestor.com'+link )
driver.close()
输出
Who’s on Third?
: https://www.institutionalinvestor.com/article/b1pqxvgpm3dwjb/Who-s-on-Third
First the Cyberattack Hits. Then the Insider Trading.
: https://www.institutionalinvestor.com/article/b1pzfhkhcv70m1/First-the-Cyberattack-Hits-Then-the-Insider-Trading
Hedge Funds Featured Prominently in 2020 SPAC Boom
: https://www.institutionalinvestor.com/article/b1pzg04d0bbvxz/Hedge-Funds-Featured-Prominently-in-2020-SPAC-Boom
The Stocks That Drove Glenview’s Major Comeback
: https://www.institutionalinvestor.com/article/b1pzf7qb428t3x/The-Stocks-That-Drove-Glenview-s-Major-Comeback
Bill Ackman’s Billion-Dollar Year
: https://www.institutionalinvestor.com/article/b1pzgx69sxhstk/Bill-Ackman-s-Billion-Dollar-Year
Ex-Verger Interns Make NFL, ‘Bachelor’ Debuts
: https://www.institutionalinvestor.com/article/b1pzg3qjq9xt5x/Ex-Verger-Interns-Make-NFL-Bachelor-Debuts
David Einhorn’s Greenlight Capital Pulls Off a Coup in the Fourth Quarter
: https://www.institutionalinvestor.com/article/b1pyl5mtkmpt80/David-Einhorn-s-Greenlight-Capital-Pulls-Off-a-Coup-in-the-Fourth-Quarter
Gold's 2020 Ride Explained
: https://www.institutionalinvestor.com/article/b1psmn58mppsyj/gold39s-2020-ride-explained
The ARK Invest Takeover Battle Is Over
: https://www.institutionalinvestor.com/article/b1pw88ldyr905m/The-ARK-Invest-Takeover-Battle-Is-Over
Investors Quickly Saw Big Gains From These SPACs
: https://www.institutionalinvestor.com/article/b1pt6fl7c9dsqc/Investors-Quickly-Saw-Big-Gains-From-These-SPACs
是的,它是动态的。您可以使用 selenium 让页面首先呈现,然后像通常对静态站点所做的那样拉出 html。或者,它们的 api 都在那里(我认为即使是完整的文章也在那里,但我只是提取了您要求的内容):
import requests
import json
import pandas as pd
api = 'https://search.euromoneyapi.com/api/Search'
headers= {'content-type': 'application/json',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
payload = {"site":"amg_ii",
"suggester":'true',
"from":0,
"size":10,
"sort":"dates",
"sort_order":"desc"}
data = {"site":"amg_ii","suggester":True,"from":0,"size":10,"sort":"dates","sort_order":"desc"}
jsonData = requests.post(api, headers=headers, data=json.dumps(data)).json()
rows = []
articles = jsonData['hits']['results']
for article in articles:
title = article['snippet']['title'][0]
source = 'https://www.institutionalinvestor.com/'
try:
preview = article['snippet']['description'][0]
except:
preview = ''
url = 'https://www.institutionalinvestor.com/article/' + article['id'].split('/')[-1] + '/' + article['fields']['url_title'][0]
row = {'title':title,
'source':source,
'preview':preview,
'url':url}
rows.append(row)
df = pd.DataFrame(rows)
输出:
print (df.to_string())
title source preview url
0 Who’s on Third? https://www.institutionalinvestor.com/ Third-party claims filing service providers require due diligence for shareholder litigation outside the U.S https://www.institutionalinvestor.com/article/b1pqxvgpm3dwjb/Who-s-on-Third
1 First the Cyberattack Hits. Then the Insider Trading. https://www.institutionalinvestor.com/ Researchers share their striking evidence of pre-disclosure spikes in options trading. https://www.institutionalinvestor.com/article/b1pzfhkhcv70m1/First-the-Cyberattack-Hits-Then-the-Insider-Trading
2 Hedge Funds Featured Prominently in 2020 SPAC Boom https://www.institutionalinvestor.com/ Nearly 13 percent of the blank check companies that filed plans to go public in 2020 were sponsored by hedge fund firms or individuals formerly associated with the industry. https://www.institutionalinvestor.com/article/b1pzg04d0bbvxz/Hedge-Funds-Featured-Prominently-in-2020-SPAC-Boom
3 The Stocks That Drove Glenview’s Major Comeback https://www.institutionalinvestor.com/ Larry Robbins’ hedge fund finished 2020 solidly positive thanks to huge gains in the final two months of the year. https://www.institutionalinvestor.com/article/b1pzf7qb428t3x/The-Stocks-That-Drove-Glenview-s-Major-Comeback
4 Bill Ackman’s Billion-Dollar Year https://www.institutionalinvestor.com/ A big short and a big SPAC fueled hefty gains for Pershing Square in 2020. https://www.institutionalinvestor.com/article/b1pzgx69sxhstk/Bill-Ackman-s-Billion-Dollar-Year
5 Ex-Verger Interns Make NFL, ‘Bachelor’ Debuts https://www.institutionalinvestor.com/ Verger Capital Management CIO Jim Dunn shared the inside story on former interns John Wolford and Matt James. https://www.institutionalinvestor.com/article/b1pzg3qjq9xt5x/Ex-Verger-Interns-Make-NFL-Bachelor-Debuts
6 David Einhorn’s Greenlight Capital Pulls Off a Coup in the Fourth Quarter https://www.institutionalinvestor.com/ The manager turned in a strong fourth quarter by sticking with his biggest positions. https://www.institutionalinvestor.com/article/b1pyl5mtkmpt80/David-Einhorn-s-Greenlight-Capital-Pulls-Off-a-Coup-in-the-Fourth-Quarter
7 Gold's 2020 Ride Explained https://www.institutionalinvestor.com/ https://www.institutionalinvestor.com/article/b1psmn58mppsyj/gold39s-2020-ride-explained
8 The ARK Invest Takeover Battle Is Over https://www.institutionalinvestor.com/ A new deal has “extinguished” Resolute’s option to acquire an additional stake in the ETF firm. https://www.institutionalinvestor.com/article/b1pw88ldyr905m/The-ARK-Invest-Takeover-Battle-Is-Over
9 Investors Quickly Saw Big Gains From These SPACs https://www.institutionalinvestor.com/ At least two blank-check companies surged on recent merger announcements. https://www.institutionalinvestor.com/article/b1pt6fl7c9dsqc/Investors-Quickly-Saw-Big-Gains-From-These-SPACs
我正在尝试使用 Python 从动态网站中提取投资新闻文章。我已经尝试了几个适用于静态网站的教程,但是我在将 URL 拉到特定文章时遇到了问题。我正在使用的代码如下:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.institutionalinvestor.com/search?'
'term=&' # eventually, the term would include the words I am actively searching for
'filters=%7B"dates":%5B"last%20week"%5D%7D') # filter to the last week, this would eventually be for the last 24 hours only
r.html.absolute_links
这让我得到一个数组格式的页面内链接列表:
{'https://www.institutionalinvestor.com/Login', 'https://www.institutionalinvestor.com/display-advertising', 'http://www.ttivanguard.com/', 'https://www.riaintel.com/', 'http://interactive.institutionalinvestor.com/executive-IR-research-em/about-586KX-2742AB.html', 'https://twitter.com/iimag', 'https://myaccount.institutionalinvestor.com/Orders/SelectPackage.html', 'https://www.institutionalinvestor.com/', 'https://www.institutionalinvestor.com/Corner-Office', 'https://www.institutionalinvestor.com/Management', 'http://iimemberships.com/', 'http://www.iiconferences.com/', 'https://www.institutionalinvestor.com/Register', 'https://www.institutionalinvestor.com/cookies', 'https://www.institutionalinvestor.com/Careers', 'https://www.institutionalinvestor.com/Custom-Research', 'https://www.institutionalinvestor.com/Portfolio', 'https://www.euromoneyplc.com/modern-slavery-act-transparency-statement', 'https://www.institutionalinvestor.com/research', 'https://www.institutionalinvestor.com/Masthead', 'https://www.institutionalinvestor.com/about-thought-leadership', 'https://www.institutionalinvestor.com/Investors', 'https://www.institutionalinvestor.com/Premium', 'https://www.institutionalinvestor.com/about-us', 'https://www.institutionalinvestor.com/thought-leadership', 'https://www.institutionalinvestor.com/PrivacyPolicy', 'https://www.institutionalinvestor.com/sponsored', 'https://www.institutionalinvestor.com/Video', 'https://www.institutionalinvestor.com/How-to-Pitch-Institutional-Investor', 'https://www.institutionalinvestor.com/FAQs', 'https://www.institutionalinvestor.com/Research-FAQs', 'https://www.institutionalinvestor.com/Reprints', 'https://www.institutionalinvestor.com/TermsConditions', 'https://www.linkedin.com/company/164389', 'https://www.facebook.com/iimag', 'https://www.institutionalinvestor.com/Customer-Service', 'https://www.institutionalinvestor.com/Culture', 'https://www.institutionalinvestor.com/awards', 'https://www.institutionalinvestor.com/Research-Insight', 'http://www.sovereignwealthcenter.com/'}
但是我找不到文章本身的链接。当我检查源代码时,这是我看到的:
<div class="search-results" role="listbox">
<article class="search-result" ng-repeat="article in serverData.hits.results">
<div class="search-result-text-ghost"></div>
<h2 ng-class="article|publicationClass"><a ng-href="{{article|articleHref}}">{{article|snippet:'title'|removeHtmlTags}}</a>
</h2>
作为 HTML 的新手,最后的 h2 部分让我相信网站是动态的,这就是我被困的地方。任何帮助,将不胜感激。我对这个问题的理想输出是获取文章标题、来源(在本例中为“Institutional Investor”)、文章预览(前几行左右,以及 URL将文章放入一个数据框中,可以每天早上发送给我以节省时间,否则我会花时间手动拉取新闻。我已经将项目的其余部分放在一起,除了 Institutional Investor 等网站的新闻拉取之外,这些网站不包含在API我在用
如有必要或推荐,我愿意接受任何和所有新方法。提前致谢!
尝试使用硒
简单的工作示例 您可能需要优化一些东西,例如 baseUrl、dataframe 而不是 print、...
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
url = "https://www.institutionalinvestor.com/search"
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
sleep(3)
soup = BeautifulSoup(driver.page_source,"lxml")
for article in soup.select('div.search-results > article'):
title = article.find('h2').get_text()
link = article.find('a')['href']
print(title +': https://www.institutionalinvestor.com'+link )
driver.close()
输出
Who’s on Third?
: https://www.institutionalinvestor.com/article/b1pqxvgpm3dwjb/Who-s-on-Third
First the Cyberattack Hits. Then the Insider Trading.
: https://www.institutionalinvestor.com/article/b1pzfhkhcv70m1/First-the-Cyberattack-Hits-Then-the-Insider-Trading
Hedge Funds Featured Prominently in 2020 SPAC Boom
: https://www.institutionalinvestor.com/article/b1pzg04d0bbvxz/Hedge-Funds-Featured-Prominently-in-2020-SPAC-Boom
The Stocks That Drove Glenview’s Major Comeback
: https://www.institutionalinvestor.com/article/b1pzf7qb428t3x/The-Stocks-That-Drove-Glenview-s-Major-Comeback
Bill Ackman’s Billion-Dollar Year
: https://www.institutionalinvestor.com/article/b1pzgx69sxhstk/Bill-Ackman-s-Billion-Dollar-Year
Ex-Verger Interns Make NFL, ‘Bachelor’ Debuts
: https://www.institutionalinvestor.com/article/b1pzg3qjq9xt5x/Ex-Verger-Interns-Make-NFL-Bachelor-Debuts
David Einhorn’s Greenlight Capital Pulls Off a Coup in the Fourth Quarter
: https://www.institutionalinvestor.com/article/b1pyl5mtkmpt80/David-Einhorn-s-Greenlight-Capital-Pulls-Off-a-Coup-in-the-Fourth-Quarter
Gold's 2020 Ride Explained
: https://www.institutionalinvestor.com/article/b1psmn58mppsyj/gold39s-2020-ride-explained
The ARK Invest Takeover Battle Is Over
: https://www.institutionalinvestor.com/article/b1pw88ldyr905m/The-ARK-Invest-Takeover-Battle-Is-Over
Investors Quickly Saw Big Gains From These SPACs
: https://www.institutionalinvestor.com/article/b1pt6fl7c9dsqc/Investors-Quickly-Saw-Big-Gains-From-These-SPACs
是的,它是动态的。您可以使用 selenium 让页面首先呈现,然后像通常对静态站点所做的那样拉出 html。或者,它们的 api 都在那里(我认为即使是完整的文章也在那里,但我只是提取了您要求的内容):
import requests
import json
import pandas as pd
api = 'https://search.euromoneyapi.com/api/Search'
headers= {'content-type': 'application/json',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
payload = {"site":"amg_ii",
"suggester":'true',
"from":0,
"size":10,
"sort":"dates",
"sort_order":"desc"}
data = {"site":"amg_ii","suggester":True,"from":0,"size":10,"sort":"dates","sort_order":"desc"}
jsonData = requests.post(api, headers=headers, data=json.dumps(data)).json()
rows = []
articles = jsonData['hits']['results']
for article in articles:
title = article['snippet']['title'][0]
source = 'https://www.institutionalinvestor.com/'
try:
preview = article['snippet']['description'][0]
except:
preview = ''
url = 'https://www.institutionalinvestor.com/article/' + article['id'].split('/')[-1] + '/' + article['fields']['url_title'][0]
row = {'title':title,
'source':source,
'preview':preview,
'url':url}
rows.append(row)
df = pd.DataFrame(rows)
输出:
print (df.to_string())
title source preview url
0 Who’s on Third? https://www.institutionalinvestor.com/ Third-party claims filing service providers require due diligence for shareholder litigation outside the U.S https://www.institutionalinvestor.com/article/b1pqxvgpm3dwjb/Who-s-on-Third
1 First the Cyberattack Hits. Then the Insider Trading. https://www.institutionalinvestor.com/ Researchers share their striking evidence of pre-disclosure spikes in options trading. https://www.institutionalinvestor.com/article/b1pzfhkhcv70m1/First-the-Cyberattack-Hits-Then-the-Insider-Trading
2 Hedge Funds Featured Prominently in 2020 SPAC Boom https://www.institutionalinvestor.com/ Nearly 13 percent of the blank check companies that filed plans to go public in 2020 were sponsored by hedge fund firms or individuals formerly associated with the industry. https://www.institutionalinvestor.com/article/b1pzg04d0bbvxz/Hedge-Funds-Featured-Prominently-in-2020-SPAC-Boom
3 The Stocks That Drove Glenview’s Major Comeback https://www.institutionalinvestor.com/ Larry Robbins’ hedge fund finished 2020 solidly positive thanks to huge gains in the final two months of the year. https://www.institutionalinvestor.com/article/b1pzf7qb428t3x/The-Stocks-That-Drove-Glenview-s-Major-Comeback
4 Bill Ackman’s Billion-Dollar Year https://www.institutionalinvestor.com/ A big short and a big SPAC fueled hefty gains for Pershing Square in 2020. https://www.institutionalinvestor.com/article/b1pzgx69sxhstk/Bill-Ackman-s-Billion-Dollar-Year
5 Ex-Verger Interns Make NFL, ‘Bachelor’ Debuts https://www.institutionalinvestor.com/ Verger Capital Management CIO Jim Dunn shared the inside story on former interns John Wolford and Matt James. https://www.institutionalinvestor.com/article/b1pzg3qjq9xt5x/Ex-Verger-Interns-Make-NFL-Bachelor-Debuts
6 David Einhorn’s Greenlight Capital Pulls Off a Coup in the Fourth Quarter https://www.institutionalinvestor.com/ The manager turned in a strong fourth quarter by sticking with his biggest positions. https://www.institutionalinvestor.com/article/b1pyl5mtkmpt80/David-Einhorn-s-Greenlight-Capital-Pulls-Off-a-Coup-in-the-Fourth-Quarter
7 Gold's 2020 Ride Explained https://www.institutionalinvestor.com/ https://www.institutionalinvestor.com/article/b1psmn58mppsyj/gold39s-2020-ride-explained
8 The ARK Invest Takeover Battle Is Over https://www.institutionalinvestor.com/ A new deal has “extinguished” Resolute’s option to acquire an additional stake in the ETF firm. https://www.institutionalinvestor.com/article/b1pw88ldyr905m/The-ARK-Invest-Takeover-Battle-Is-Over
9 Investors Quickly Saw Big Gains From These SPACs https://www.institutionalinvestor.com/ At least two blank-check companies surged on recent merger announcements. https://www.institutionalinvestor.com/article/b1pt6fl7c9dsqc/Investors-Quickly-Saw-Big-Gains-From-These-SPACs