递归网页抓取分页
Recursive Web Scraping Pagination
我正在尝试从以下网站抓取一些房地产文章:
我设法获得了我需要的 link,但我在网络上的分页问题上遇到了困难page.I我正在尝试抓取每个类别下的每个 link [=18] =]、'building your team'、'capital rising' etc.Some 这些类别页面有分页,其中一些不包含 pagination.I 尝试使用以下代码但它只给了我 links 来自 2 页。
from requests_html import HTMLSession
def tag_words_links(url):
global _session
_request = _session.get(url)
tags = _request.html.find('a.tag-cloud-link')
links = []
for link in tags:
links.append({
'Tags': link.find('a', first=True).text,
'Links': link.find('a', first=True).attrs['href']
})
return links
def parse_tag_links(link):
global _session
_request = _session.get(link)
articles = []
try:
next_page = _request.html.find('link[rel="next"]', first=True).attrs['href']
_request = _session.get(next_page)
article_links = _request.html.find('h3 a')
for article in article_links:
articles.append(article.find('a', first=True).attrs['href'])
except:
_request = _session.get(link)
article_links = _request.html.find('h3 a')
for article in article_links:
articles.append(article.find('a', first=True).attrs['href'])
return articles
if __name__ == '__main__':
_session = HTMLSession()
url = 'https://lifebridgecapital.com/podcast/'
links = tag_words_links(url)
print(parse_tag_links('https://lifebridgecapital.com/tag/multifamily/'))
要打印每个标签下的每篇文章的标题和标签下的每个页面,您可以使用此示例:
import requests
from bs4 import BeautifulSoup
url = "https://lifebridgecapital.com/podcast/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
tag_links = [a["href"] for a in soup.select(".tagcloud a")]
for link in tag_links:
while True:
print(link)
print("-" * 80)
soup = BeautifulSoup(requests.get(link).content, "html.parser")
for title in soup.select("h3 a"):
print(title.text)
print()
next_link = soup.select_one("a.next")
if not next_link:
break
link = next_link["href"]
打印:
...
https://lifebridgecapital.com/tag/multifamily/
--------------------------------------------------------------------------------
WS890: Successful Asset Classes In The Current Market with Jerome Maldonado
WS889: How To Avoid A ,000,000 Mistake with Hugh Odom
WS888: Value-Based On BRRRR VS Cap Rate with John Stoeber
WS887: Slow And Steady Still Wins The Race with Nicole Pendergrass
WS287: Increase Your NOI by Converting Units to Short Term Rentals with Michael Sjogren
WS271: Investment Strategies To Survive An Economic Downturn with Vinney Chopra
WS270: Owning a Construction Company Creates More Value with Abraham Ng’hwani
WS269: The Impacts of Your First Deal with Kyle Mitchell
WS260: Structuring Deals To Get The Best Return On Investment with Jeff Greenberg
WS259: Capital Raising For Newbies with Bryan Taylor
https://lifebridgecapital.com/tag/multifamily/page/2/
--------------------------------------------------------------------------------
WS257: Why Ground Up Development is the Best Investment with Sam Bates
WS256: Mobile Home Park Investing: The Real Deal with Jefferson Lilly
WS249: Managing Real Estate Paperwork Successfully with Krista Testani
WS245: Multifamily Syndication with Venkat Avasarala
WS244: Passive Investing In Real Estate with Kay Kay Singh
WS243: Getting Started In Real Estate Brokerage with Tyler Chesser
WS213: Data Analytics In Real Estate with Raj Tekchandani
WS202: Ben Leybovich and Sam Grooms on The Advantages Of A Partnership In Real Estate Business
WS199: Financial Freedom Through Real Estate Investing with Rodney Miller
WS197: Loan Qualifications: How The Whole Process Works with Vinney Chopra
https://lifebridgecapital.com/tag/multifamily/page/3/
--------------------------------------------------------------------------------
WS172: Real Estate Syndication with Kyle Jones
...
我正在尝试从以下网站抓取一些房地产文章:
我设法获得了我需要的 link,但我在网络上的分页问题上遇到了困难page.I我正在尝试抓取每个类别下的每个 link [=18] =]、'building your team'、'capital rising' etc.Some 这些类别页面有分页,其中一些不包含 pagination.I 尝试使用以下代码但它只给了我 links 来自 2 页。
from requests_html import HTMLSession
def tag_words_links(url):
global _session
_request = _session.get(url)
tags = _request.html.find('a.tag-cloud-link')
links = []
for link in tags:
links.append({
'Tags': link.find('a', first=True).text,
'Links': link.find('a', first=True).attrs['href']
})
return links
def parse_tag_links(link):
global _session
_request = _session.get(link)
articles = []
try:
next_page = _request.html.find('link[rel="next"]', first=True).attrs['href']
_request = _session.get(next_page)
article_links = _request.html.find('h3 a')
for article in article_links:
articles.append(article.find('a', first=True).attrs['href'])
except:
_request = _session.get(link)
article_links = _request.html.find('h3 a')
for article in article_links:
articles.append(article.find('a', first=True).attrs['href'])
return articles
if __name__ == '__main__':
_session = HTMLSession()
url = 'https://lifebridgecapital.com/podcast/'
links = tag_words_links(url)
print(parse_tag_links('https://lifebridgecapital.com/tag/multifamily/'))
要打印每个标签下的每篇文章的标题和标签下的每个页面,您可以使用此示例:
import requests
from bs4 import BeautifulSoup
url = "https://lifebridgecapital.com/podcast/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
tag_links = [a["href"] for a in soup.select(".tagcloud a")]
for link in tag_links:
while True:
print(link)
print("-" * 80)
soup = BeautifulSoup(requests.get(link).content, "html.parser")
for title in soup.select("h3 a"):
print(title.text)
print()
next_link = soup.select_one("a.next")
if not next_link:
break
link = next_link["href"]
打印:
...
https://lifebridgecapital.com/tag/multifamily/
--------------------------------------------------------------------------------
WS890: Successful Asset Classes In The Current Market with Jerome Maldonado
WS889: How To Avoid A ,000,000 Mistake with Hugh Odom
WS888: Value-Based On BRRRR VS Cap Rate with John Stoeber
WS887: Slow And Steady Still Wins The Race with Nicole Pendergrass
WS287: Increase Your NOI by Converting Units to Short Term Rentals with Michael Sjogren
WS271: Investment Strategies To Survive An Economic Downturn with Vinney Chopra
WS270: Owning a Construction Company Creates More Value with Abraham Ng’hwani
WS269: The Impacts of Your First Deal with Kyle Mitchell
WS260: Structuring Deals To Get The Best Return On Investment with Jeff Greenberg
WS259: Capital Raising For Newbies with Bryan Taylor
https://lifebridgecapital.com/tag/multifamily/page/2/
--------------------------------------------------------------------------------
WS257: Why Ground Up Development is the Best Investment with Sam Bates
WS256: Mobile Home Park Investing: The Real Deal with Jefferson Lilly
WS249: Managing Real Estate Paperwork Successfully with Krista Testani
WS245: Multifamily Syndication with Venkat Avasarala
WS244: Passive Investing In Real Estate with Kay Kay Singh
WS243: Getting Started In Real Estate Brokerage with Tyler Chesser
WS213: Data Analytics In Real Estate with Raj Tekchandani
WS202: Ben Leybovich and Sam Grooms on The Advantages Of A Partnership In Real Estate Business
WS199: Financial Freedom Through Real Estate Investing with Rodney Miller
WS197: Loan Qualifications: How The Whole Process Works with Vinney Chopra
https://lifebridgecapital.com/tag/multifamily/page/3/
--------------------------------------------------------------------------------
WS172: Real Estate Syndication with Kyle Jones
...