用于产品分页的数据抓取以获取所有产品详细信息

Data Scraping for Pagination of the Products to get all products details

我想抓取具有 URL = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/' 的 'Cushion cover' 类别的所有产品数据 我分析了数据在脚本标签里,但是如何从所有页面获取数据。我需要所有页面中所有产品的 URL,并且不同页面的数据也在 API 中 API= 'https://www.noon.com/_next/data/B60DhzfamQWEpEl9Q8ajE/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover.json?limit=50&page=2&sort%5Bby%5D=popularity&sort%5Bdir%5D=desc&catalog=home-and-kitchen&catalog=home-decor&catalog=slipcovers&catalog=cushion-cover'
如果我们继续更改上面的页码 link 我们有各个页面的数据但是如何从不同的页面获取数据 请对此提出建议。

import requests
import pandas as pd
import json
import csv
from lxml import html

headers ={'authority': 'www.noon.com',
      'accept' : 
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
      'accept-encoding': 'gzip, deflate, br',
      'accept-language': 'en-US,en;q=0.9',
      }
 produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
 prodresp = requests.get(produrl, headers = headers, timeout =30)
 prodResphtml = html.fromstring(prodresp.text)
 print(prodresp)


 partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
 partjson = partjson[0]

我用的是re lib。换句话说,我使用正则表达式来抓取任何页面使用JavaScript

要好得多
import requests
import pandas as pd
import json
import csv
from lxml import html
import re
headers ={
    'authority': 'www.noon.com',
    'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
      }
url = "https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/"
prodresp = requests.get(url, headers = headers, timeout =30)
jsonpage = re.findall(r'type="application/json">(.*?)</script>', prodresp.text)
jsonpage = json.loads(jsonpage[0])

您即将实现目标。您可以使用 for loop and range function 使下一页意味着分页以拉出所有页面,因为我们知道总页码为 192,这就是为什么我采用这种稳健的方式进行分页。因此,要从所有页面中获取所有产品 url(或任何数据项),您可以按照下一个示例进行操作。

脚本:

import requests
import pandas as pd
import json
from lxml import html

headers ={
      'authority': 'www.noon.com',
      'accept' :'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
      'accept-encoding': 'gzip, deflate, br',
      'accept-language': 'en-US,en;q=0.9',
      }
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/?limit=50&page={page}&sort[by]=popularity&sort[dir]=desc'
data=[]
for page in range(0,192):
    prodresp = requests.get(produrl.format(page=page), headers = headers, timeout =30)
    prodResphtml = html.fromstring(prodresp.text)
    #print(prodresp)

    partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
    #partjson = partjson[0]
    partjson = json.loads(partjson[0])
    #print(partjson)

    # with open('data.json','w',encoding='utf-8') as f:
    #     f.write(partjson)

    for item in partjson['props']['pageProps']['props']['catalog']['hits']:
        link='https://www.noon.com/'+item['url']
        data.append(link)

df = pd.DataFrame(data,columns=['URL'])
#df.to_csv('product.csv',index=False)#to save data into your system
print(df)

输出:

                     URL
0     https://www.noon.com/graphic-geometric-pattern...
1     https://www.noon.com/classic-nordic-decorative...
2     https://www.noon.com/embroidered-iconic-medusa...
3     https://www.noon.com/geometric-marble-texture-...
4     https://www.noon.com/traditional-damask-motif-...
...                                                 ...
9594  https://www.noon.com/geometric-printed-cushion...
9595  https://www.noon.com/chinese-style-art-printed...
9596  https://www.noon.com/chinese-style-art-printed...
9597  https://www.noon.com/chinese-style-art-printed...
9598  https://www.noon.com/chinese-style-art-printed...

[9599 rows x 1 columns]