美丽的汤 find returns none from rightmove

beautiful soup find returns none from rightmove

我正在尝试使用漂亮的汤从这里解析 html:https://www.rightmove.co.uk/house-prices/br5/broadcroft-road.html?page=1

我有:

req=requests.get(url)
# page_soup = soup(req.content,'html.parser')
page_soup = soup(req.content,'lxml') 
no_results= page_soup.find('div',{'class':'section sort-bar-results'})
containers = page_soup.findAll('div',{'class':'propertyCard'})
no_results, len(containers)

这个returns(None, 0)

我看了, , Beautiful soup returns None, ,可惜none帮了我

html 的部分对应于:

有什么明显的我遗漏的东西吗?

页面确实有动态内容。你应该使用 selenium 和 webdriver 在抓取之前加载所有内容。

您可以尝试下载 ChromeDriver 可执行文件 here。如果将其粘贴到与脚本相同的文件夹中,则可以 运行:

import os
from selenium import webdriver
from bs4 import BeautifulSoup

# configure driver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\chromedriver.exe"  # IF NOT IN SAME FOLDER CHANGE THIS PATH
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)

url = 'https://www.rightmove.co.uk/house-prices/br5/broadcroft-road.html?page=1'

driver.get(url)
page_soup = soup(driver.page_source, "html.parser")
no_results= page_soup.find('div',{'class':'section sort-bar-results'})
containers = page_soup.findAll('div',{'class':'propertyCard'})
print(no_results.text)
print(len(containers), "containers")

你说的答案没有帮助,但我在这里试了一下,结果是:

35 sold properties
25 containers
import requests
import re
import json


def main(url):
    r = requests.get(url)
    match = json.loads(
        re.search(r'__PRELOADED_STATE__.+?({.+?})<', r.text).group(1))
    # print(match.keys())  # Full JSON DICT Keys # match.keys()
    print(match['results']['resultCount'])


main("https://www.rightmove.co.uk/house-prices/br5/broadcroft-road.html?page=1")

输出:

35

You don't need to use selenium as it's will slow down your task. the desired element is presented within page source code as it's encoded within Dynamic <script> tag