使用 scrapy 和 splash 抓取 javascript 渲染页面时缺少项目

Missing items when scraping javascript rendered page using scrapy and splash

我正在尝试抓取以下网站以获取基本的房地产列表信息:

https://www.propertyfinder.ae/en/search?c=2&fu=0&l=50&ob=nd&page=1&rp=y

当使用 javascript 向下滚动页面时,网站的部分内容会从后端 API 动态加载。为了解决这个问题,我尝试使用 Scrapy 和 Splash 来渲染 javascript。我遇到的问题是,虽然没有返回所有列表,但它只返回 returns 前 8 个。我认为问题是页面没有向下滚动,所以页面没有填充,我需要的 div 也没有'吨呈现。然后我尝试添加一些 Lua 代码(我没有经验)来向下滚动页面,希望它会被填充,但是它没有用。下面是我的蜘蛛:

import scrapy
from scrapy.shell import inspect_response
import pandas as pd
import functools
import time
import requests
from lxml.html import fromstring
import math
from scrapy_splash import SplashRequest
import scrapy_splash



class pfspider(scrapy.Spider):
    name = 'property_finder_spider'

    start_urls = ["https://www.propertyfinder.ae/en/search?c=2&fu=0&l=50&ob=nd&page=1&rp=y"]



    script1 = """function main(splash)
        local num_scrolls = 10
        local scroll_delay = 1.0

        local scroll_to = splash:jsfunc("window.scrollTo")
        local get_body_height = splash:jsfunc(
            "function() {return document.body.scrollHeight;}"
        )
        assert(splash:go(splash.args.url))
        splash:wait(splash.args.wait)

        for _ = 1, num_scrolls do
            scroll_to(0, get_body_height())
            splash:wait(scroll_delay)
        end        
        return splash:html()
    end"""


    def start_requests(self):
        for urll in self.start_urls:
            # yield scrapy_splash.SplashRequest(url=urll, callback=self.parse, endpoint='execute',  args={'wait':2, 'lua_source': script1})
            yield scrapy_splash.SplashRequest(url=urll, endpoint='render.html', callback=self.parse)



    def parse(self, response):
        inspect_response(response, self)

        containers = response.xpath('//div[@class="column--primary"]/div[@class="card-list__item"]')

        Listing_names_pf = containers[0].xpath('//h2[@class="card__title card__title-link"]/text()').extract()

        Currency_pf = ['AED'] * len(Listing_names_pf)

        Prices_pf = containers[0].xpath('//span[@class="card__price-value"]/text()').extract()

        type_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--property-type"]/text()').extract()

        Bedrooms_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--bedrooms"]/text()').extract()

        Bathrooms_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--bathrooms"]/text()').extract()

        SQF_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--area"]/text()').extract()

        Location_pf = containers[0].xpath('//span[@class="card__location-text"]/text()').extract()

        Links_pf =  containers[0].xpath('//div[@class="card-list__item"]/a/@href').extract()

        Links_pf_full = []

        for link in Links_pf:
            Links_pf_full.append('https://www.propertyfinder.ae/'+link)


我注意到的另一件事是,当页面以启动方式呈现时,在 html 输出文件中有一个名为 Tealium 的脚本,它确实包含列表中所有项目的列表数据,但不在 div 下页面。

我们将不胜感激任何帮助或建议。

我对Scrappy不熟悉。但它只是通过 Requests 完成的。只需浏览 F12 -> XHR 选项卡即可找到以下 url.

为了更清楚,我将参数分解为元组列表,然后将其与基数重新关联 url。 include 参数可以“简化”以仅包含您要检索的数据,但默认情况下它包含所有内容。您可以遍历 页数 [number],但请注意,如果 req/s 的数量过多,您可能会被阻止。

import requests as rq

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0"}
url = "https://www.propertyfinder.ae/en/api/search?"
params = [
    ("filter[category_id]", "2"),
    ("filter[furnished]","0"),
    ("filter[locations_ids][]","50"),
    ("filter[price_type]","y"),
    ("include","properties,properties.property_type,properties.property_images,properties.location_tree,properties.agent,properties.agent.languages,properties.broker,smart_ads,smart_ads.agent,smart_ads.broker,smart_ads.property_type,smart_ads.property_images,smart_ads.location_tree,direct_from_developer,direct_from_developer.property_type,direct_from_developer.property_images,direct_from_developer.location_tree,direct_from_developer.agent,direct_from_developer.broker,cts,cts.agent,cts.broker,cts.property_type,cts.property_images,cts.location_tree,similar_properties,similar_properties.agent,similar_properties.broker,similar_properties.property_type,similar_properties.property_images,similar_properties.location_tree,agent_smart_ads,agent_smart_ads.broker,agent_smart_ads.languages,agent_properties_smart_ads,agent_properties_smart_ads.agent,agent_properties_smart_ads.broker,agent_properties_smart_ads.location_tree,agent_properties_smart_ads.property_type,agent_properties_smart_ads.property_images"),
    ("page[limit]","25"),
    ("page[number]","4"),
    ("sort","nd")
]

resp = rq.get(url, params=params, headers=headers).json()

Ensuite,你要在resp中搜索你感兴趣的数据:

resultat = []
for el in resp["included"]:
    if el["type"] == "property":
        data = {
            "name": el["attributes"]["name"],
            "default_price": el["attributes"]["default_price"],
            "bathroom_value": el["attributes"]["bathroom_value"],
            "bedroom_value": el["attributes"]["bedroom_value"],
            "coordinates" : el["attributes"]["coordinates"]}

        resultat.append(data)

结果包含:

[{'name': '1Bed Apartment | Available | Large Terrace',
  'default_price': 92000,
  'bathroom_value': 2,
  'bedroom_value': 1,
  'coordinates': {'lat': 25.08333, 'lon': 55.144753}},
 {'name': 'Furnished  |Full sea view | All bills included',
  'default_price': 179000,
  'bathroom_value': 3,
  'bedroom_value': 2,
  'coordinates': {'lat': 25.083121, 'lon': 55.141064}},
   ........

PS : selenium 当所有的抓取线索都耗尽时应该考虑