让 Web Bot 正确抓取站点的所有页面

Get Web Bot To Properly Crawl All Pages Of A Site

我正在尝试抓取网站的所有页面并提取某个 tag/class 的所有实例。

它似乎一遍又一遍地从同一页面拉取信息,但我不确定为什么,因为 len(urls) #The stack of URL's being scraped 中有一个钟形曲线变化,这让我觉得我'我至少通过链接进行了抓取,但我可能 pulling/printing 信息输出不当。

import urllib
import urlparse
import re
from bs4 import BeautifulSoup

url = "http://weedmaps.com"

如果我尝试仅使用基本 weedmaps.com URL,则不会打印任何内容,但如果我从具有我正在查找的数据类型的页面开始...url = "https://weedmaps.com/dispensaries/shakeandbake",然后它会提取信息,但它会一遍又一遍地打印相同的信息。

urls = [url] # Stack of urls to scrape
visited = [url] # Record of scraped urls
htmltext = urllib.urlopen(urls[0]).read()

# While stack of urls is greater than 0, keep scraping for links
while len(urls) > 0:
    try:
        htmltext = urllib.urlopen(urls[0]).read()

# Except for visited urls
    except:
        print urls[0]  

# Get and Print Information
    soup = BeautifulSoup(htmltext)
    urls.pop(0) 
    info = soup.findAll("div", {"class":"story-heading"})

    print info

# Number of URLs in stack
    print len(urls)

# Append Incomplete Tags    
    for tag in soup.findAll('a',href=True):
        tag['href'] = urlparse.urljoin(url,tag['href'])
        if url in tag['href'] and tag['href'] not in visited:
            urls.append(tag['href'])
            visited.append(tag['href'])

您当前代码的问题是您放入队列 (urls) 的 URL 指向同一页面,但指向不同的锚点,例如:

换句话说,tag['href'] not in visited 条件不会过滤指向同一页面但指向不同锚点的不同 URL。

据我所知,您正在重新发明一个网络抓取框架。但是已经有一个可以节省您的时间,使您的网络抓取代码有条理和干净,并且会比您当前的解决方案快得多 - Scrapy.

您需要 CrawlSpider, configure the rules 才能访问链接,例如:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class MachineSpider(CrawlSpider):
    name = 'weedmaps'
    allowed_domains = ['weedmaps.com']
    start_urls = ['https://weedmaps.com/dispensaries/shakeandbake']

    rules = [
        Rule(LinkExtractor(allow=r'/dispensaries/'), callback='parse_hours')
    ]

    def parse_hours(self, response):
        print response.url

        for hours in response.css('span[itemid="#store"] div.row.hours-row div.col-md-9'):
            print hours.xpath('text()').extract()

您的回调应该 return 或产生 Item 个实例,而不是打印您稍后可以保存到文件、数据库或管道中以不同方式处理的实例。