用于产品分页的数据抓取以获取所有产品详细信息
Data Scraping for Pagination of the Products to get all products details
我想抓取具有 URL = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/' 的 'Cushion cover' 类别的所有产品数据
我分析了数据在脚本标签里,但是如何从所有页面获取数据。我需要所有页面中所有产品的 URL,并且不同页面的数据也在 API 中 API= 'https://www.noon.com/_next/data/B60DhzfamQWEpEl9Q8ajE/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover.json?limit=50&page=2&sort%5Bby%5D=popularity&sort%5Bdir%5D=desc&catalog=home-and-kitchen&catalog=home-decor&catalog=slipcovers&catalog=cushion-cover'
如果我们继续更改上面的页码 link 我们有各个页面的数据但是如何从不同的页面获取数据
请对此提出建议。
import requests
import pandas as pd
import json
import csv
from lxml import html
headers ={'authority': 'www.noon.com',
'accept' :
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
prodresp = requests.get(produrl, headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
print(prodresp)
partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
partjson = partjson[0]
我用的是re lib。换句话说,我使用正则表达式来抓取任何页面使用JavaScript
要好得多
import requests
import pandas as pd
import json
import csv
from lxml import html
import re
headers ={
'authority': 'www.noon.com',
'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
url = "https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/"
prodresp = requests.get(url, headers = headers, timeout =30)
jsonpage = re.findall(r'type="application/json">(.*?)</script>', prodresp.text)
jsonpage = json.loads(jsonpage[0])
您即将实现目标。您可以使用 for loop and range function
使下一页意味着分页以拉出所有页面,因为我们知道总页码为 192,这就是为什么我采用这种稳健的方式进行分页。因此,要从所有页面中获取所有产品 url
(或任何数据项),您可以按照下一个示例进行操作。
脚本:
import requests
import pandas as pd
import json
from lxml import html
headers ={
'authority': 'www.noon.com',
'accept' :'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/?limit=50&page={page}&sort[by]=popularity&sort[dir]=desc'
data=[]
for page in range(0,192):
prodresp = requests.get(produrl.format(page=page), headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
#print(prodresp)
partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
#partjson = partjson[0]
partjson = json.loads(partjson[0])
#print(partjson)
# with open('data.json','w',encoding='utf-8') as f:
# f.write(partjson)
for item in partjson['props']['pageProps']['props']['catalog']['hits']:
link='https://www.noon.com/'+item['url']
data.append(link)
df = pd.DataFrame(data,columns=['URL'])
#df.to_csv('product.csv',index=False)#to save data into your system
print(df)
输出:
URL
0 https://www.noon.com/graphic-geometric-pattern...
1 https://www.noon.com/classic-nordic-decorative...
2 https://www.noon.com/embroidered-iconic-medusa...
3 https://www.noon.com/geometric-marble-texture-...
4 https://www.noon.com/traditional-damask-motif-...
... ...
9594 https://www.noon.com/geometric-printed-cushion...
9595 https://www.noon.com/chinese-style-art-printed...
9596 https://www.noon.com/chinese-style-art-printed...
9597 https://www.noon.com/chinese-style-art-printed...
9598 https://www.noon.com/chinese-style-art-printed...
[9599 rows x 1 columns]
我想抓取具有 URL = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/' 的 'Cushion cover' 类别的所有产品数据
我分析了数据在脚本标签里,但是如何从所有页面获取数据。我需要所有页面中所有产品的 URL,并且不同页面的数据也在 API 中 API= 'https://www.noon.com/_next/data/B60DhzfamQWEpEl9Q8ajE/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover.json?limit=50&page=2&sort%5Bby%5D=popularity&sort%5Bdir%5D=desc&catalog=home-and-kitchen&catalog=home-decor&catalog=slipcovers&catalog=cushion-cover'
如果我们继续更改上面的页码 link 我们有各个页面的数据但是如何从不同的页面获取数据
请对此提出建议。
import requests
import pandas as pd
import json
import csv
from lxml import html
headers ={'authority': 'www.noon.com',
'accept' :
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
prodresp = requests.get(produrl, headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
print(prodresp)
partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
partjson = partjson[0]
我用的是re lib。换句话说,我使用正则表达式来抓取任何页面使用JavaScript
要好得多import requests
import pandas as pd
import json
import csv
from lxml import html
import re
headers ={
'authority': 'www.noon.com',
'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
url = "https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/"
prodresp = requests.get(url, headers = headers, timeout =30)
jsonpage = re.findall(r'type="application/json">(.*?)</script>', prodresp.text)
jsonpage = json.loads(jsonpage[0])
您即将实现目标。您可以使用 for loop and range function
使下一页意味着分页以拉出所有页面,因为我们知道总页码为 192,这就是为什么我采用这种稳健的方式进行分页。因此,要从所有页面中获取所有产品 url
(或任何数据项),您可以按照下一个示例进行操作。
脚本:
import requests
import pandas as pd
import json
from lxml import html
headers ={
'authority': 'www.noon.com',
'accept' :'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/?limit=50&page={page}&sort[by]=popularity&sort[dir]=desc'
data=[]
for page in range(0,192):
prodresp = requests.get(produrl.format(page=page), headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
#print(prodresp)
partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
#partjson = partjson[0]
partjson = json.loads(partjson[0])
#print(partjson)
# with open('data.json','w',encoding='utf-8') as f:
# f.write(partjson)
for item in partjson['props']['pageProps']['props']['catalog']['hits']:
link='https://www.noon.com/'+item['url']
data.append(link)
df = pd.DataFrame(data,columns=['URL'])
#df.to_csv('product.csv',index=False)#to save data into your system
print(df)
输出:
URL
0 https://www.noon.com/graphic-geometric-pattern...
1 https://www.noon.com/classic-nordic-decorative...
2 https://www.noon.com/embroidered-iconic-medusa...
3 https://www.noon.com/geometric-marble-texture-...
4 https://www.noon.com/traditional-damask-motif-...
... ...
9594 https://www.noon.com/geometric-printed-cushion...
9595 https://www.noon.com/chinese-style-art-printed...
9596 https://www.noon.com/chinese-style-art-printed...
9597 https://www.noon.com/chinese-style-art-printed...
9598 https://www.noon.com/chinese-style-art-printed...
[9599 rows x 1 columns]