有分页的网页如何使用playwright和beautifulsoup?

How to use playwright and beautifulsoup on web page which has pagination?

我是网络抓取的新手。我想从这个网页上抓取数据(评论和各自的日期)https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938 它有页面分页.... 我就是这样做的

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import json
AllEntries = []
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False,slow_mo=50)
    noofforumpagesvodafone = 1000
    currentpage = 1
    page = browser.new_page()
    page.goto('https://search.donanimhaber.com/?q=vodafone&p='+ str(currentpage) + '&token=-1&hash=56BB9D1746DBCDA94D0B1E5825EFF47D&order=date&in=all&type=both&scope=all&range=all',timeout = 0)
    html = page.inner_html("div.results")
    soup = BeautifulSoup(html, 'html.parser')
    xx = [x.get('href') for x in soup.find_all('a')]

    xxi = 0
    time = []
    while(xxi<1):
        if(xx[xxi][0] == "/"):
            entry = []
            # page.goto('https://search.donanimhaber.com' + str(xx[xxi]),timeout = 0)
            page.goto("https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938")

            html = page.inner_html("div.kl-icerik")
            soup = BeautifulSoup(html, 'html.parser')

            for table in soup.findAll('div', {'class': 'ki-cevapicerigi'}):
                for t in table.findAll('span', {'class': 'mButon info'}):
                    print(t.text)

                for links in table.findAll('span', {'class': 'msg'}):
                     for link in links.findAll('td'):
                          print(link.text)
                     for linko in links.findAll('p'):
                          print(linko.text)

此代码仅适用于第一页,它会相应地给出所有评论和日期..但不是来自 第 2.3.4 页......当我们滚动到按钮时出现

我该怎么做...谢谢

在您的特殊情况下,每个页面都有自己的 link。它是您的基础 link 和页码,中间​​有一个连字符 (-)。

您可以在点击第二页时看到此行为,将您的 base-link 与现在的 link 进行比较: https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938-2

(注意最后的-2)

一种方法是在 for-loop 中更改 url,迭代到 24 并单独抓取所有这些页面。

你可以做得更简单。你不需要打开浏览器,只需要简单的POST请求攻击本站API
你有 api 请求 uri: https://search.donanimhaber.com/api/search/messages/?q=vodafone&p=1&order=date&in=all&type=both&scope=all&daterange=all.
您可以更改一些参数:
q= 您搜索的词
p=分页。
还可以使用 playwright 1.20.0 攻击 API.
https://playwright.dev/python/docs/api/class-apiresponse.
它会给你 json 这样的回应。

      {
        "forumId": 600,
        "id": 152365149,
        "topicId": 102657976,
        "newsId": 0,
        "dateCreated": "2022-03-25T22:51:42.407304+03:00",
        "dateString": "3 dakika önce",
        "body": "Yenilenen 9 GB var o da 44 den 58 olmuş. Son durumda hangi tarifler var fiyatları neler güncel bir tablo olsa içinden seçsek güzel olur",
        "forumTitle": "Cep Telefonu ve Operatörler",
        "subject": "<span class='highlight'>Vodafone</span>dan Gizli Tarifeler! (İlle de <span class='highlight'>Vodafone</span> kullanacağım diyenlere.)",
        "imageUrl": null,
        "subResults": [
            {
                "forumId": 0,
                "id": 152364084,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T21:03:18.1226493+03:00",
                "dateString": "1 saat önce",
                "body": "Benim 26 liralık saçma güzel 2+ tarifem 37 lira olmuş.Daha düşük fiyata bir şeyler var mı ? 1 gb int...",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            },
            {
                "forumId": 0,
                "id": 152362724,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T18:17:14.8390711+03:00",
                "dateString": "4 saat önce",
                "body": "Allah aşkına daha yeni geçtik bi dur be..",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            },
            {
                "forumId": 0,
                "id": 152362447,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T17:35:51.9730334+03:00",
                "dateString": "5 saat önce",
                "body": "Olay 5 gb da 41 lira olacakmış yuh artık .",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            },
            {
                "forumId": 0,
                "id": 152360755,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T13:49:20.9403583+03:00",
                "dateString": "9 saat önce",
                "body": "demin mesaj geldi olay15gb 65 tl olarak güncellenecektir. son 3 ayda rahat 25 tl zam geldi tarifeme ...",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            },
            {
                "forumId": 0,
                "id": 152360644,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T13:31:47.8799036+03:00",
                "dateString": "9 saat önce",
                "body": "Kazançlı 7 GB tarifesinin fiyatı 31.03'ten sonra 55 TL olacakmış. Bu nasıl zam, yazıklar olsun.",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            },
            {
                "forumId": 0,
                "id": 152359613,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T11:18:20.4112563+03:00",
                "dateString": "11 saat önce",
                "body": "Kolay gelsin.\nTaahhütümün bitmesine 20 gün kala dijital asistanın bana önerdiği tarifeye geçiş yaptı...",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            }
        ]
    },