递归网页抓取分页

Question

我正在尝试从以下网站抓取一些房地产文章：

我设法获得了我需要的 link，但我在网络上的分页问题上遇到了困难page.I我正在尝试抓取每个类别下的每个 link [=18] =]、'building your team'、'capital rising' etc.Some 这些类别页面有分页，其中一些不包含 pagination.I 尝试使用以下代码但它只给了我 links 来自 2 页。

from requests_html import HTMLSession


def tag_words_links(url):
    global _session
    _request = _session.get(url)
    tags = _request.html.find('a.tag-cloud-link')
    links = []
    for link in tags:
        links.append({
             'Tags': link.find('a', first=True).text,
             'Links': link.find('a', first=True).attrs['href']
         })

    return links

def parse_tag_links(link):
    global _session
    _request = _session.get(link)
    articles = []
    try:
       next_page = _request.html.find('link[rel="next"]', first=True).attrs['href']
       _request = _session.get(next_page)
       article_links = _request.html.find('h3 a')
       for article in article_links:
           articles.append(article.find('a', first=True).attrs['href'])

    except:
        _request = _session.get(link)
        article_links = _request.html.find('h3 a')
        for article in article_links:
            articles.append(article.find('a', first=True).attrs['href'])


   return articles


if __name__ == '__main__':
   _session = HTMLSession()
   url = 'https://lifebridgecapital.com/podcast/'
   links = tag_words_links(url)
   print(parse_tag_links('https://lifebridgecapital.com/tag/multifamily/'))

Answer 1

要打印每个标签下的每篇文章的标题和标签下的每个页面，您可以使用此示例：

import requests
from bs4 import BeautifulSoup


url = "https://lifebridgecapital.com/podcast/"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
tag_links = [a["href"] for a in soup.select(".tagcloud a")]

for link in tag_links:
    while True:
        print(link)
        print("-" * 80)

        soup = BeautifulSoup(requests.get(link).content, "html.parser")

        for title in soup.select("h3 a"):
            print(title.text)

        print()

        next_link = soup.select_one("a.next")
        if not next_link:
            break

        link = next_link["href"]

打印：

...

https://lifebridgecapital.com/tag/multifamily/
--------------------------------------------------------------------------------
WS890: Successful Asset Classes In The Current Market with Jerome Maldonado
WS889: How To Avoid A ,000,000 Mistake with Hugh Odom
WS888: Value-Based On BRRRR VS Cap Rate with John Stoeber
WS887: Slow And Steady Still Wins The Race with Nicole Pendergrass
WS287: Increase Your NOI by Converting Units to Short Term Rentals with Michael Sjogren
WS271: Investment Strategies To Survive An Economic Downturn with Vinney Chopra
WS270: Owning a Construction Company Creates More Value with Abraham Ng’hwani
WS269: The Impacts of Your First Deal with Kyle Mitchell
WS260: Structuring Deals To Get The Best Return On Investment with Jeff Greenberg
WS259: Capital Raising For Newbies with Bryan Taylor

https://lifebridgecapital.com/tag/multifamily/page/2/
--------------------------------------------------------------------------------
WS257: Why Ground Up Development is the Best Investment with Sam Bates
WS256: Mobile Home Park Investing: The Real Deal with Jefferson Lilly
WS249: Managing Real Estate Paperwork Successfully with Krista Testani
WS245: Multifamily Syndication with Venkat Avasarala
WS244: Passive Investing In Real Estate with Kay Kay Singh
WS243: Getting Started In Real Estate Brokerage with Tyler Chesser
WS213: Data Analytics In Real Estate with Raj Tekchandani
WS202: Ben Leybovich and Sam Grooms on The Advantages Of A Partnership In Real Estate Business
WS199: Financial Freedom Through Real Estate Investing with Rodney Miller
WS197: Loan Qualifications: How The Whole Process Works with Vinney Chopra

https://lifebridgecapital.com/tag/multifamily/page/3/
--------------------------------------------------------------------------------
WS172: Real Estate Syndication with Kyle Jones

...

递归网页抓取分页

Recursive Web Scraping Pagination

python

pagination

web-scraping

python-3.x

python-requests-html