使用 scrapy 和 splash 抓取 javascript 渲染页面时缺少项目
Missing items when scraping javascript rendered page using scrapy and splash
我正在尝试抓取以下网站以获取基本的房地产列表信息:
https://www.propertyfinder.ae/en/search?c=2&fu=0&l=50&ob=nd&page=1&rp=y
当使用 javascript 向下滚动页面时,网站的部分内容会从后端 API 动态加载。为了解决这个问题,我尝试使用 Scrapy 和 Splash 来渲染 javascript。我遇到的问题是,虽然没有返回所有列表,但它只返回 returns 前 8 个。我认为问题是页面没有向下滚动,所以页面没有填充,我需要的 div 也没有'吨呈现。然后我尝试添加一些 Lua 代码(我没有经验)来向下滚动页面,希望它会被填充,但是它没有用。下面是我的蜘蛛:
import scrapy
from scrapy.shell import inspect_response
import pandas as pd
import functools
import time
import requests
from lxml.html import fromstring
import math
from scrapy_splash import SplashRequest
import scrapy_splash
class pfspider(scrapy.Spider):
name = 'property_finder_spider'
start_urls = ["https://www.propertyfinder.ae/en/search?c=2&fu=0&l=50&ob=nd&page=1&rp=y"]
script1 = """function main(splash)
local num_scrolls = 10
local scroll_delay = 1.0
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
return splash:html()
end"""
def start_requests(self):
for urll in self.start_urls:
# yield scrapy_splash.SplashRequest(url=urll, callback=self.parse, endpoint='execute', args={'wait':2, 'lua_source': script1})
yield scrapy_splash.SplashRequest(url=urll, endpoint='render.html', callback=self.parse)
def parse(self, response):
inspect_response(response, self)
containers = response.xpath('//div[@class="column--primary"]/div[@class="card-list__item"]')
Listing_names_pf = containers[0].xpath('//h2[@class="card__title card__title-link"]/text()').extract()
Currency_pf = ['AED'] * len(Listing_names_pf)
Prices_pf = containers[0].xpath('//span[@class="card__price-value"]/text()').extract()
type_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--property-type"]/text()').extract()
Bedrooms_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--bedrooms"]/text()').extract()
Bathrooms_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--bathrooms"]/text()').extract()
SQF_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--area"]/text()').extract()
Location_pf = containers[0].xpath('//span[@class="card__location-text"]/text()').extract()
Links_pf = containers[0].xpath('//div[@class="card-list__item"]/a/@href').extract()
Links_pf_full = []
for link in Links_pf:
Links_pf_full.append('https://www.propertyfinder.ae/'+link)
我注意到的另一件事是,当页面以启动方式呈现时,在 html 输出文件中有一个名为 Tealium 的脚本,它确实包含列表中所有项目的列表数据,但不在 div 下页面。
我们将不胜感激任何帮助或建议。
我对Scrappy不熟悉。但它只是通过 Requests 完成的。只需浏览 F12 -> XHR 选项卡即可找到以下 url.
为了更清楚,我将参数分解为元组列表,然后将其与基数重新关联 url。 include 参数可以“简化”以仅包含您要检索的数据,但默认情况下它包含所有内容。您可以遍历 页数 [number],但请注意,如果 req/s 的数量过多,您可能会被阻止。
import requests as rq
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0"}
url = "https://www.propertyfinder.ae/en/api/search?"
params = [
("filter[category_id]", "2"),
("filter[furnished]","0"),
("filter[locations_ids][]","50"),
("filter[price_type]","y"),
("include","properties,properties.property_type,properties.property_images,properties.location_tree,properties.agent,properties.agent.languages,properties.broker,smart_ads,smart_ads.agent,smart_ads.broker,smart_ads.property_type,smart_ads.property_images,smart_ads.location_tree,direct_from_developer,direct_from_developer.property_type,direct_from_developer.property_images,direct_from_developer.location_tree,direct_from_developer.agent,direct_from_developer.broker,cts,cts.agent,cts.broker,cts.property_type,cts.property_images,cts.location_tree,similar_properties,similar_properties.agent,similar_properties.broker,similar_properties.property_type,similar_properties.property_images,similar_properties.location_tree,agent_smart_ads,agent_smart_ads.broker,agent_smart_ads.languages,agent_properties_smart_ads,agent_properties_smart_ads.agent,agent_properties_smart_ads.broker,agent_properties_smart_ads.location_tree,agent_properties_smart_ads.property_type,agent_properties_smart_ads.property_images"),
("page[limit]","25"),
("page[number]","4"),
("sort","nd")
]
resp = rq.get(url, params=params, headers=headers).json()
Ensuite,你要在resp中搜索你感兴趣的数据:
resultat = []
for el in resp["included"]:
if el["type"] == "property":
data = {
"name": el["attributes"]["name"],
"default_price": el["attributes"]["default_price"],
"bathroom_value": el["attributes"]["bathroom_value"],
"bedroom_value": el["attributes"]["bedroom_value"],
"coordinates" : el["attributes"]["coordinates"]}
resultat.append(data)
结果包含:
[{'name': '1Bed Apartment | Available | Large Terrace',
'default_price': 92000,
'bathroom_value': 2,
'bedroom_value': 1,
'coordinates': {'lat': 25.08333, 'lon': 55.144753}},
{'name': 'Furnished |Full sea view | All bills included',
'default_price': 179000,
'bathroom_value': 3,
'bedroom_value': 2,
'coordinates': {'lat': 25.083121, 'lon': 55.141064}},
........
PS : selenium 当所有的抓取线索都耗尽时应该考虑
我正在尝试抓取以下网站以获取基本的房地产列表信息:
https://www.propertyfinder.ae/en/search?c=2&fu=0&l=50&ob=nd&page=1&rp=y
当使用 javascript 向下滚动页面时,网站的部分内容会从后端 API 动态加载。为了解决这个问题,我尝试使用 Scrapy 和 Splash 来渲染 javascript。我遇到的问题是,虽然没有返回所有列表,但它只返回 returns 前 8 个。我认为问题是页面没有向下滚动,所以页面没有填充,我需要的 div 也没有'吨呈现。然后我尝试添加一些 Lua 代码(我没有经验)来向下滚动页面,希望它会被填充,但是它没有用。下面是我的蜘蛛:
import scrapy
from scrapy.shell import inspect_response
import pandas as pd
import functools
import time
import requests
from lxml.html import fromstring
import math
from scrapy_splash import SplashRequest
import scrapy_splash
class pfspider(scrapy.Spider):
name = 'property_finder_spider'
start_urls = ["https://www.propertyfinder.ae/en/search?c=2&fu=0&l=50&ob=nd&page=1&rp=y"]
script1 = """function main(splash)
local num_scrolls = 10
local scroll_delay = 1.0
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
return splash:html()
end"""
def start_requests(self):
for urll in self.start_urls:
# yield scrapy_splash.SplashRequest(url=urll, callback=self.parse, endpoint='execute', args={'wait':2, 'lua_source': script1})
yield scrapy_splash.SplashRequest(url=urll, endpoint='render.html', callback=self.parse)
def parse(self, response):
inspect_response(response, self)
containers = response.xpath('//div[@class="column--primary"]/div[@class="card-list__item"]')
Listing_names_pf = containers[0].xpath('//h2[@class="card__title card__title-link"]/text()').extract()
Currency_pf = ['AED'] * len(Listing_names_pf)
Prices_pf = containers[0].xpath('//span[@class="card__price-value"]/text()').extract()
type_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--property-type"]/text()').extract()
Bedrooms_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--bedrooms"]/text()').extract()
Bathrooms_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--bathrooms"]/text()').extract()
SQF_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--area"]/text()').extract()
Location_pf = containers[0].xpath('//span[@class="card__location-text"]/text()').extract()
Links_pf = containers[0].xpath('//div[@class="card-list__item"]/a/@href').extract()
Links_pf_full = []
for link in Links_pf:
Links_pf_full.append('https://www.propertyfinder.ae/'+link)
我注意到的另一件事是,当页面以启动方式呈现时,在 html 输出文件中有一个名为 Tealium 的脚本,它确实包含列表中所有项目的列表数据,但不在 div 下页面。
我们将不胜感激任何帮助或建议。
我对Scrappy不熟悉。但它只是通过 Requests 完成的。只需浏览 F12 -> XHR 选项卡即可找到以下 url.
为了更清楚,我将参数分解为元组列表,然后将其与基数重新关联 url。 include 参数可以“简化”以仅包含您要检索的数据,但默认情况下它包含所有内容。您可以遍历 页数 [number],但请注意,如果 req/s 的数量过多,您可能会被阻止。
import requests as rq
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0"}
url = "https://www.propertyfinder.ae/en/api/search?"
params = [
("filter[category_id]", "2"),
("filter[furnished]","0"),
("filter[locations_ids][]","50"),
("filter[price_type]","y"),
("include","properties,properties.property_type,properties.property_images,properties.location_tree,properties.agent,properties.agent.languages,properties.broker,smart_ads,smart_ads.agent,smart_ads.broker,smart_ads.property_type,smart_ads.property_images,smart_ads.location_tree,direct_from_developer,direct_from_developer.property_type,direct_from_developer.property_images,direct_from_developer.location_tree,direct_from_developer.agent,direct_from_developer.broker,cts,cts.agent,cts.broker,cts.property_type,cts.property_images,cts.location_tree,similar_properties,similar_properties.agent,similar_properties.broker,similar_properties.property_type,similar_properties.property_images,similar_properties.location_tree,agent_smart_ads,agent_smart_ads.broker,agent_smart_ads.languages,agent_properties_smart_ads,agent_properties_smart_ads.agent,agent_properties_smart_ads.broker,agent_properties_smart_ads.location_tree,agent_properties_smart_ads.property_type,agent_properties_smart_ads.property_images"),
("page[limit]","25"),
("page[number]","4"),
("sort","nd")
]
resp = rq.get(url, params=params, headers=headers).json()
Ensuite,你要在resp中搜索你感兴趣的数据:
resultat = []
for el in resp["included"]:
if el["type"] == "property":
data = {
"name": el["attributes"]["name"],
"default_price": el["attributes"]["default_price"],
"bathroom_value": el["attributes"]["bathroom_value"],
"bedroom_value": el["attributes"]["bedroom_value"],
"coordinates" : el["attributes"]["coordinates"]}
resultat.append(data)
结果包含:
[{'name': '1Bed Apartment | Available | Large Terrace',
'default_price': 92000,
'bathroom_value': 2,
'bedroom_value': 1,
'coordinates': {'lat': 25.08333, 'lon': 55.144753}},
{'name': 'Furnished |Full sea view | All bills included',
'default_price': 179000,
'bathroom_value': 3,
'bedroom_value': 2,
'coordinates': {'lat': 25.083121, 'lon': 55.141064}},
........
PS : selenium 当所有的抓取线索都耗尽时应该考虑