抓取需要您向下滚动的网站

Question

我正在尝试在此处抓取此网站：

但是，它需要我向下滚动才能收集更多数据。我不知道如何使用 Beautiful soup 或 python 向下滚动。这里有人知道怎么做吗？

代码有点乱，但就在这里。

import scrapy
from scrapy.selector import Selector
from testtest.items import TesttestItem
import datetime
from selenium import webdriver
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser
import re
import time

class MLStripper(HTMLParser):


class MySpider(scrapy.Spider):
        name = "A1Locker"

        def strip_tags(html):
            s = MLStripper()
            s.feed(html)
            return s.get_data()

     allowed_domains = ['https://www.a1lockerrental.com']
    start_urls = ['http://www.a1lockerrental.com/self-storage/mo/st-
 louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
 category=all']

     def parse(self, response):

                 url='http://www.a1lockerrental.com/self-storage/mo/st-
louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
category=Small'
                driver = webdriver.Firefox()
                driver.get(url)
                html = driver.page_source
                soup = BeautifulSoup(html, 'html.parser')
        url2='http://www.a1lockerrental.com/self-storage/mo/st-louis/4427-
meramec-bottom-rd-facility/unit-sizes-prices#/units?category=Medium'
        driver2 = webdriver.Firefox()
                driver2.get(url2)
                html2 = driver.page_source
                soup2 = BeautifulSoup(html2, 'html.parser')                
                #soup.append(soup2)
                #print soup
        items = []
        inside = "Indoor"
                outside = "Outdoor"
        inside_units = ["5 x 5", "5 x 10"]
        outside_units = ["10 x 15","5 x 15", "8 x 10","10 x 10","10 x 
20","10 x 25","10 x 30"]
        sizeTagz = soup.findAll('span',{"class":"sss-unit-size"})
        sizeTagz2 = soup2.findAll('span',{"class":"sss-unit-size"})
        #print soup.findAll('span',{"class":"sss-unit-size"})



        rateTagz = soup.findAll('p',{"class":"unit-special-offer"})


        specialTagz = soup.findAll('span',{"class":"unit-special-offer"})
        typesTagz = soup.findAll('div',{"class":"unit-info"},)

        rateTagz2 = soup2.findAll('p',{"class":"unit-special-offer"})


        specialTagz2 = soup2.findAll('span',{"class":"unit-special-offer"})
        typesTagz2 = soup2.findAll('div',{"class":"unit-info"},)
        yield {'date': datetime.datetime.now().strftime("%m-%d-%y"),
                'name': "A1Locker"
                   }
        size = []
        for n in range(len(sizeTagz)):
                    print len(rateTagz)
                    print len(typesTagz)

                    if "Outside" in (typesTagz[n]).get_text():



                            size.append(re.findall(r'\d+',
 (sizeTagz[n]).get_text()))
                            size.append(re.findall(r'\d+',
 (sizeTagz2[n]).get_text()))
                            print "logic hit"
                for i in range(len(size)):
                        yield {
                    #soup.findAll('p',{"class":"icon-bg"})
                    #'name': soup.find('strong', {'class':'high'}).text

                    'size': size[i]
                    #"special": (specialTagz[n]).get_text(),
                    #"rate": re.findall(r'\d+',(rateTagz[n]).get_text()),
                    #"size": i.css(".sss-unit-size::text").extract(),
                    #"types": "Outside"

    }
            driver.close()

代码的预期输出是让它显示从该网页收集的数据：http://www.a1lockerrental.com/self-storage/mo/st-louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?category=all

这样做需要能够向下滚动以查看其余数据。至少我的想法是这样。

谢谢， DM123

Answer 1

有一个提供此功能的网络驱动程序函数。 BeautifulSoup 除了解析站点外什么都不做。

看看这个：http://webdriver.io/api/utility/scroll.html

Answer 2

您尝试抓取的网站正在使用 JavaScript 动态加载内容。不幸的是，很多网络爬虫，比如美丽的汤，不能自己执行JavaScript。然而，有许多选项，其中许多以无头浏览器的形式出现。一个经典的是 PhantomJS, but it may be worth taking a look at this great list of options on GitHub，其中一些可能与漂亮的汤搭配得很好，比如 Selenium。

牢记 Selenium，this Whosebug question 的答案也可能有所帮助。

抓取需要您向下滚动的网站

scraping a website that requires you to scroll down

javascript

python

dynamic

beautifulsoup

bs4