Web Scraping 识别请求的执行和故障排除

Web Scraping Identifying executing and troubleshooting a request

我在从以下网站抓取数据时遇到一些问题:

https://www.loft.com.br/apartamentos/sao-paulo-sp?q=pin

当我们加载页面时,它会加载前 ~30 个关于圣保罗市真实状态的帖子。 如果我们向下滚动,它会加载更多帖子。

通常我会使用 selenium 来解决这个问题 - 但我想学习如何正确地做到这一点 - 我想那是通过摆弄请求。

通过在 chrome 上使用检查并观察我们向下滚动时发生的情况,我可以看到发出的请求,我认为这是检索新帖子的请求。

如果我将其内容复制为 curl,我会得到以下命令:

curl "https://landscape-api.loft.com.br/listing/search?city=S^%^C3^%^A3o^%^20Paulo^&facetFilters^\[^\]=address.city^%^3AS^%^C3^%^A3o^%^20Paulo^&limit=18^&limitedColumns=true^&loftUserId=417b37df-19ab-4014-a800-688c5acc039d^&offset=28^&orderBy^\[^\]=rankB^&orderByStatus=^%^27FOR_SALE^%^27^%^2C^%^20^%^27JUST_LISTED^%^27^%^2C^%^20^%^27DEMOLITION^%^27^%^2C^%^20^%^27COMING_SOON^%^27^%^20^%^2C^%^20^%^27SOLD^%^27^&originType=LISTINGS_LOAD_MORE^&q=pin^&status^\[^\]=FOR_SALE^&status^\[^\]=JUST_LISTED^&status^\[^\]=DEMOLITION^&status^\[^\]=COMING_SOON^&status^\[^\]=SOLD" ^
  -X "OPTIONS" ^
  -H "Connection: keep-alive" ^
  -H "Accept: */*" ^
  -H "Access-Control-Request-Method: GET" ^
  -H "Access-Control-Request-Headers: loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id" ^
  -H "Origin: https://www.loft.com.br" ^
  -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36" ^
  -H "Sec-Fetch-Mode: cors" ^
  -H "Sec-Fetch-Site: same-site" ^
  -H "Sec-Fetch-Dest: empty" ^
  -H "Referer: https://www.loft.com.br/" ^
  -H "Accept-Language: en-US,en;q=0.9" ^
  --compressed

我不确定将其转换为要在 python 模块 requests 中使用的命令的正确方法 - 所以我使用了这个网站 - https://curl.trillworks.com/ - 去做。

结果是:

import requests

headers = {
    'Connection': 'keep-alive',
    'Accept': '*/*',
    'Access-Control-Request-Method': 'GET',
    'Access-Control-Request-Headers': 'loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id',
    'Origin': 'https://www.loft.com.br',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-site',
    'Sec-Fetch-Dest': 'empty',
    'Referer': 'https://www.loft.com.br/',
    'Accept-Language': 'en-US,en;q=0.9',
}

params = (
    ('city', 'S\xE3o Paulo'),
    ('facetFilters/[/]', 'address.city:S\xE3o Paulo'),
    ('limit', '18'),
    ('limitedColumns', 'true'),
    ('loftUserId', '417b37df-19ab-4014-a800-688c5acc039d'),
    ('offset', '28'),
    ('orderBy/[/]', 'rankB'),
    ('orderByStatus', '\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\''),
    ('originType', 'LISTINGS_LOAD_MORE'),
    ('q', 'pin'),
    ('status/[/]', ['FOR_SALE', 'JUST_LISTED', 'DEMOLITION', 'COMING_SOON', 'SOLD']),
)

response = requests.options('https://landscape-api.loft.com.br/listing/search', headers=headers, params=params)

但是,当我尝试 运行 它时,我得到一个 204

所以我的问题是:

  1. 识别来自该网站的请求的 proper/best 方法是什么?除了我所做的,还有更好的选择吗?
  2. 一旦确定,copy as curl 是否是复制命令的最佳方式?
  3. 如何最好地复制 Python 中的命令?
  4. 为什么我得到 204?

1- You did it the proper way! I have been doing it the same way for a long time, and based on my experiences on webscraping, using your browser network tab is by far the best way to get info about the requests made on a website, better than any "extension" and/or "plugin" that I know of!!! There is also "burp suit" on kali linux or on windows, but again the network tab on the browser is always my number one choice!

2- I have been using the same website that you mentioned!!! It makes my life easier and works seamlessly fine. Of curse, you could do it manually, but the website you mentioned makes it easier and faster for me, and I have been using it for a long time!

3- You could do it manually, it's pretty straightforward, but like I said, the website you mentioned makes it easier and faster .

4- It's probably because you're using requests.options, I would try to use requests.get instead!!!

您查找请求的方法是正确的。但是你需要找到并分析正确​​的请求。
关于为什么你得到 204 响应代码没有结果;您发送 OPTION 个请求而不是 GET。在 Chrome DevTools 中,您可以看到两个类似的请求(查看附图)。一个是 OPTION,第二个是 GET,类型为 xhr
对于网站,您需要第二个,但您在代码 requests.options(..) 中使用了 OPTION 要查看请求的响应 select 它并检查响应或预览选项卡。

Python 中最好的 HTTP 库之一是

下面是获取所有搜索结果的完整代码:

import requests

headers = {
    'x-user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_0) AppleWebKit/537.36 (KHTML, like Gecko) '
                    'Chrome/88.0.4324.146 Safari/537.36',
    'utm_created_at': '',
    'Accept': 'application/json, text/plain, */*',
}

with requests.Session() as s:
    s.headers = headers

    listings = list()
    limit = 18
    offset = 0
    while True:
        params = {
            "city": "São Paulo",
            "facetFilters/[/]": "address.city:São Paulo",
            "limit": limit,
            "limitedColumns": "true",
            # "loftUserId": "a2531ad4-cc3f-49b0-8828-e78fb489def8",
            "offset": offset,
            "orderBy/[/]": "rankA",
            "orderByStatus": "\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\'",
            "originType": "LISTINGS_LOAD_MORE",
            "q": "pin",
            "status/[/]": ["FOR_SALE", "JUST_LISTED", "DEMOLITION", "COMING_SOON", "SOLD"]
        }
        r = s.get('https://landscape-api.loft.com.br/listing/search', params=params)
        r.raise_for_status()

        data = r.json()
        listings.extend(data["listings"])

        offset += limit
        total = data["pagination"]["total"]
        if len(data["listings"]) == 0 or len(listings) == total:
            break

print(len(listings))