美丽的汤 find returns none from rightmove
beautiful soup find returns none from rightmove
我正在尝试使用漂亮的汤从这里解析 html:https://www.rightmove.co.uk/house-prices/br5/broadcroft-road.html?page=1
我有:
req=requests.get(url)
# page_soup = soup(req.content,'html.parser')
page_soup = soup(req.content,'lxml')
no_results= page_soup.find('div',{'class':'section sort-bar-results'})
containers = page_soup.findAll('div',{'class':'propertyCard'})
no_results, len(containers)
这个returns(None, 0)
我看了, , Beautiful soup returns None, ,可惜none帮了我
html 的部分对应于:
和
有什么明显的我遗漏的东西吗?
页面确实有动态内容。你应该使用 selenium 和 webdriver 在抓取之前加载所有内容。
您可以尝试下载 ChromeDriver 可执行文件 here。如果将其粘贴到与脚本相同的文件夹中,则可以 运行:
import os
from selenium import webdriver
from bs4 import BeautifulSoup
# configure driver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\chromedriver.exe" # IF NOT IN SAME FOLDER CHANGE THIS PATH
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)
url = 'https://www.rightmove.co.uk/house-prices/br5/broadcroft-road.html?page=1'
driver.get(url)
page_soup = soup(driver.page_source, "html.parser")
no_results= page_soup.find('div',{'class':'section sort-bar-results'})
containers = page_soup.findAll('div',{'class':'propertyCard'})
print(no_results.text)
print(len(containers), "containers")
你说的答案没有帮助,但我在这里试了一下,结果是:
35 sold properties
25 containers
import requests
import re
import json
def main(url):
r = requests.get(url)
match = json.loads(
re.search(r'__PRELOADED_STATE__.+?({.+?})<', r.text).group(1))
# print(match.keys()) # Full JSON DICT Keys # match.keys()
print(match['results']['resultCount'])
main("https://www.rightmove.co.uk/house-prices/br5/broadcroft-road.html?page=1")
输出:
35
You don't need to use selenium
as it's will slow down your task. the desired element
is presented within page source code as it's encoded within Dynamic
<script>
tag
我正在尝试使用漂亮的汤从这里解析 html:https://www.rightmove.co.uk/house-prices/br5/broadcroft-road.html?page=1
我有:
req=requests.get(url)
# page_soup = soup(req.content,'html.parser')
page_soup = soup(req.content,'lxml')
no_results= page_soup.find('div',{'class':'section sort-bar-results'})
containers = page_soup.findAll('div',{'class':'propertyCard'})
no_results, len(containers)
这个returns(None, 0)
我看了
html 的部分对应于:
和
有什么明显的我遗漏的东西吗?
页面确实有动态内容。你应该使用 selenium 和 webdriver 在抓取之前加载所有内容。
您可以尝试下载 ChromeDriver 可执行文件 here。如果将其粘贴到与脚本相同的文件夹中,则可以 运行:
import os
from selenium import webdriver
from bs4 import BeautifulSoup
# configure driver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\chromedriver.exe" # IF NOT IN SAME FOLDER CHANGE THIS PATH
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)
url = 'https://www.rightmove.co.uk/house-prices/br5/broadcroft-road.html?page=1'
driver.get(url)
page_soup = soup(driver.page_source, "html.parser")
no_results= page_soup.find('div',{'class':'section sort-bar-results'})
containers = page_soup.findAll('div',{'class':'propertyCard'})
print(no_results.text)
print(len(containers), "containers")
你说的答案没有帮助,但我在这里试了一下,结果是:
35 sold properties
25 containers
import requests
import re
import json
def main(url):
r = requests.get(url)
match = json.loads(
re.search(r'__PRELOADED_STATE__.+?({.+?})<', r.text).group(1))
# print(match.keys()) # Full JSON DICT Keys # match.keys()
print(match['results']['resultCount'])
main("https://www.rightmove.co.uk/house-prices/br5/broadcroft-road.html?page=1")
输出:
35
You don't need to use
selenium
as it's will slow down your task. the desiredelement
is presented within page source code as it's encoded withinDynamic
<script>
tag