有分页的网页如何使用playwright和beautifulsoup?
How to use playwright and beautifulsoup on web page which has pagination?
我是网络抓取的新手。我想从这个网页上抓取数据(评论和各自的日期)https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938 它有页面分页....
我就是这样做的
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import json
AllEntries = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False,slow_mo=50)
noofforumpagesvodafone = 1000
currentpage = 1
page = browser.new_page()
page.goto('https://search.donanimhaber.com/?q=vodafone&p='+ str(currentpage) + '&token=-1&hash=56BB9D1746DBCDA94D0B1E5825EFF47D&order=date&in=all&type=both&scope=all&range=all',timeout = 0)
html = page.inner_html("div.results")
soup = BeautifulSoup(html, 'html.parser')
xx = [x.get('href') for x in soup.find_all('a')]
xxi = 0
time = []
while(xxi<1):
if(xx[xxi][0] == "/"):
entry = []
# page.goto('https://search.donanimhaber.com' + str(xx[xxi]),timeout = 0)
page.goto("https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938")
html = page.inner_html("div.kl-icerik")
soup = BeautifulSoup(html, 'html.parser')
for table in soup.findAll('div', {'class': 'ki-cevapicerigi'}):
for t in table.findAll('span', {'class': 'mButon info'}):
print(t.text)
for links in table.findAll('span', {'class': 'msg'}):
for link in links.findAll('td'):
print(link.text)
for linko in links.findAll('p'):
print(linko.text)
此代码仅适用于第一页,它会相应地给出所有评论和日期..但不是来自
第 2.3.4 页......当我们滚动到按钮时出现
我该怎么做...谢谢
在您的特殊情况下,每个页面都有自己的 link。它是您的基础 link 和页码,中间有一个连字符 (-)。
您可以在点击第二页时看到此行为,将您的 base-link 与现在的 link 进行比较:
https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938-2
(注意最后的-2)
一种方法是在 for-loop 中更改 url,迭代到 24 并单独抓取所有这些页面。
你可以做得更简单。你不需要打开浏览器,只需要简单的POST请求攻击本站API。
你有 api 请求 uri:
https://search.donanimhaber.com/api/search/messages/?q=vodafone&p=1&order=date&in=all&type=both&scope=all&daterange=all.
您可以更改一些参数:
q= 您搜索的词
p=分页。
还可以使用 playwright 1.20.0 攻击 API.
https://playwright.dev/python/docs/api/class-apiresponse.
它会给你 json 这样的回应。
{
"forumId": 600,
"id": 152365149,
"topicId": 102657976,
"newsId": 0,
"dateCreated": "2022-03-25T22:51:42.407304+03:00",
"dateString": "3 dakika önce",
"body": "Yenilenen 9 GB var o da 44 den 58 olmuş. Son durumda hangi tarifler var fiyatları neler güncel bir tablo olsa içinden seçsek güzel olur",
"forumTitle": "Cep Telefonu ve Operatörler",
"subject": "<span class='highlight'>Vodafone</span>dan Gizli Tarifeler! (İlle de <span class='highlight'>Vodafone</span> kullanacağım diyenlere.)",
"imageUrl": null,
"subResults": [
{
"forumId": 0,
"id": 152364084,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T21:03:18.1226493+03:00",
"dateString": "1 saat önce",
"body": "Benim 26 liralık saçma güzel 2+ tarifem 37 lira olmuş.Daha düşük fiyata bir şeyler var mı ? 1 gb int...",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
},
{
"forumId": 0,
"id": 152362724,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T18:17:14.8390711+03:00",
"dateString": "4 saat önce",
"body": "Allah aşkına daha yeni geçtik bi dur be..",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
},
{
"forumId": 0,
"id": 152362447,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T17:35:51.9730334+03:00",
"dateString": "5 saat önce",
"body": "Olay 5 gb da 41 lira olacakmış yuh artık .",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
},
{
"forumId": 0,
"id": 152360755,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T13:49:20.9403583+03:00",
"dateString": "9 saat önce",
"body": "demin mesaj geldi olay15gb 65 tl olarak güncellenecektir. son 3 ayda rahat 25 tl zam geldi tarifeme ...",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
},
{
"forumId": 0,
"id": 152360644,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T13:31:47.8799036+03:00",
"dateString": "9 saat önce",
"body": "Kazançlı 7 GB tarifesinin fiyatı 31.03'ten sonra 55 TL olacakmış. Bu nasıl zam, yazıklar olsun.",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
},
{
"forumId": 0,
"id": 152359613,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T11:18:20.4112563+03:00",
"dateString": "11 saat önce",
"body": "Kolay gelsin.\nTaahhütümün bitmesine 20 gün kala dijital asistanın bana önerdiği tarifeye geçiş yaptı...",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
}
]
},
我是网络抓取的新手。我想从这个网页上抓取数据(评论和各自的日期)https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938 它有页面分页.... 我就是这样做的
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import json
AllEntries = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False,slow_mo=50)
noofforumpagesvodafone = 1000
currentpage = 1
page = browser.new_page()
page.goto('https://search.donanimhaber.com/?q=vodafone&p='+ str(currentpage) + '&token=-1&hash=56BB9D1746DBCDA94D0B1E5825EFF47D&order=date&in=all&type=both&scope=all&range=all',timeout = 0)
html = page.inner_html("div.results")
soup = BeautifulSoup(html, 'html.parser')
xx = [x.get('href') for x in soup.find_all('a')]
xxi = 0
time = []
while(xxi<1):
if(xx[xxi][0] == "/"):
entry = []
# page.goto('https://search.donanimhaber.com' + str(xx[xxi]),timeout = 0)
page.goto("https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938")
html = page.inner_html("div.kl-icerik")
soup = BeautifulSoup(html, 'html.parser')
for table in soup.findAll('div', {'class': 'ki-cevapicerigi'}):
for t in table.findAll('span', {'class': 'mButon info'}):
print(t.text)
for links in table.findAll('span', {'class': 'msg'}):
for link in links.findAll('td'):
print(link.text)
for linko in links.findAll('p'):
print(linko.text)
此代码仅适用于第一页,它会相应地给出所有评论和日期..但不是来自 第 2.3.4 页......当我们滚动到按钮时出现
我该怎么做...谢谢
在您的特殊情况下,每个页面都有自己的 link。它是您的基础 link 和页码,中间有一个连字符 (-)。
您可以在点击第二页时看到此行为,将您的 base-link 与现在的 link 进行比较: https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938-2
(注意最后的-2)
一种方法是在 for-loop 中更改 url,迭代到 24 并单独抓取所有这些页面。
你可以做得更简单。你不需要打开浏览器,只需要简单的POST请求攻击本站API。
你有 api 请求 uri:
https://search.donanimhaber.com/api/search/messages/?q=vodafone&p=1&order=date&in=all&type=both&scope=all&daterange=all.
您可以更改一些参数:
q= 您搜索的词
p=分页。
还可以使用 playwright 1.20.0 攻击 API.
https://playwright.dev/python/docs/api/class-apiresponse.
它会给你 json 这样的回应。
{
"forumId": 600,
"id": 152365149,
"topicId": 102657976,
"newsId": 0,
"dateCreated": "2022-03-25T22:51:42.407304+03:00",
"dateString": "3 dakika önce",
"body": "Yenilenen 9 GB var o da 44 den 58 olmuş. Son durumda hangi tarifler var fiyatları neler güncel bir tablo olsa içinden seçsek güzel olur",
"forumTitle": "Cep Telefonu ve Operatörler",
"subject": "<span class='highlight'>Vodafone</span>dan Gizli Tarifeler! (İlle de <span class='highlight'>Vodafone</span> kullanacağım diyenlere.)",
"imageUrl": null,
"subResults": [
{
"forumId": 0,
"id": 152364084,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T21:03:18.1226493+03:00",
"dateString": "1 saat önce",
"body": "Benim 26 liralık saçma güzel 2+ tarifem 37 lira olmuş.Daha düşük fiyata bir şeyler var mı ? 1 gb int...",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
},
{
"forumId": 0,
"id": 152362724,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T18:17:14.8390711+03:00",
"dateString": "4 saat önce",
"body": "Allah aşkına daha yeni geçtik bi dur be..",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
},
{
"forumId": 0,
"id": 152362447,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T17:35:51.9730334+03:00",
"dateString": "5 saat önce",
"body": "Olay 5 gb da 41 lira olacakmış yuh artık .",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
},
{
"forumId": 0,
"id": 152360755,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T13:49:20.9403583+03:00",
"dateString": "9 saat önce",
"body": "demin mesaj geldi olay15gb 65 tl olarak güncellenecektir. son 3 ayda rahat 25 tl zam geldi tarifeme ...",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
},
{
"forumId": 0,
"id": 152360644,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T13:31:47.8799036+03:00",
"dateString": "9 saat önce",
"body": "Kazançlı 7 GB tarifesinin fiyatı 31.03'ten sonra 55 TL olacakmış. Bu nasıl zam, yazıklar olsun.",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
},
{
"forumId": 0,
"id": 152359613,
"topicId": 0,
"newsId": 0,
"dateCreated": "2022-03-25T11:18:20.4112563+03:00",
"dateString": "11 saat önce",
"body": "Kolay gelsin.\nTaahhütümün bitmesine 20 gün kala dijital asistanın bana önerdiği tarifeye geçiş yaptı...",
"forumTitle": null,
"subject": null,
"imageUrl": null,
"subResults": null
}
]
},