让 Web Bot 正确抓取站点的所有页面
Get Web Bot To Properly Crawl All Pages Of A Site
我正在尝试抓取网站的所有页面并提取某个 tag/class 的所有实例。
它似乎一遍又一遍地从同一页面拉取信息,但我不确定为什么,因为 len(urls) #The stack of URL's being scraped
中有一个钟形曲线变化,这让我觉得我'我至少通过链接进行了抓取,但我可能 pulling/printing 信息输出不当。
import urllib
import urlparse
import re
from bs4 import BeautifulSoup
url = "http://weedmaps.com"
如果我尝试仅使用基本 weedmaps.com URL,则不会打印任何内容,但如果我从具有我正在查找的数据类型的页面开始...url = "https://weedmaps.com/dispensaries/shakeandbake"
,然后它会提取信息,但它会一遍又一遍地打印相同的信息。
urls = [url] # Stack of urls to scrape
visited = [url] # Record of scraped urls
htmltext = urllib.urlopen(urls[0]).read()
# While stack of urls is greater than 0, keep scraping for links
while len(urls) > 0:
try:
htmltext = urllib.urlopen(urls[0]).read()
# Except for visited urls
except:
print urls[0]
# Get and Print Information
soup = BeautifulSoup(htmltext)
urls.pop(0)
info = soup.findAll("div", {"class":"story-heading"})
print info
# Number of URLs in stack
print len(urls)
# Append Incomplete Tags
for tag in soup.findAll('a',href=True):
tag['href'] = urlparse.urljoin(url,tag['href'])
if url in tag['href'] and tag['href'] not in visited:
urls.append(tag['href'])
visited.append(tag['href'])
您当前代码的问题是您放入队列 (urls
) 的 URL 指向同一页面,但指向不同的锚点,例如:
- https://weedmaps.com/dispensaries/shakeandbake#videos
- https://weedmaps.com/dispensaries/shakeandbake#weedmenu
换句话说,tag['href'] not in visited
条件不会过滤指向同一页面但指向不同锚点的不同 URL。
据我所知,您正在重新发明一个网络抓取框架。但是已经有一个可以节省您的时间,使您的网络抓取代码有条理和干净,并且会比您当前的解决方案快得多 - Scrapy
.
您需要 CrawlSpider
, configure the rules
才能访问链接,例如:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MachineSpider(CrawlSpider):
name = 'weedmaps'
allowed_domains = ['weedmaps.com']
start_urls = ['https://weedmaps.com/dispensaries/shakeandbake']
rules = [
Rule(LinkExtractor(allow=r'/dispensaries/'), callback='parse_hours')
]
def parse_hours(self, response):
print response.url
for hours in response.css('span[itemid="#store"] div.row.hours-row div.col-md-9'):
print hours.xpath('text()').extract()
您的回调应该 return 或产生 Item
个实例,而不是打印您稍后可以保存到文件、数据库或管道中以不同方式处理的实例。
我正在尝试抓取网站的所有页面并提取某个 tag/class 的所有实例。
它似乎一遍又一遍地从同一页面拉取信息,但我不确定为什么,因为 len(urls) #The stack of URL's being scraped
中有一个钟形曲线变化,这让我觉得我'我至少通过链接进行了抓取,但我可能 pulling/printing 信息输出不当。
import urllib
import urlparse
import re
from bs4 import BeautifulSoup
url = "http://weedmaps.com"
如果我尝试仅使用基本 weedmaps.com URL,则不会打印任何内容,但如果我从具有我正在查找的数据类型的页面开始...url = "https://weedmaps.com/dispensaries/shakeandbake"
,然后它会提取信息,但它会一遍又一遍地打印相同的信息。
urls = [url] # Stack of urls to scrape
visited = [url] # Record of scraped urls
htmltext = urllib.urlopen(urls[0]).read()
# While stack of urls is greater than 0, keep scraping for links
while len(urls) > 0:
try:
htmltext = urllib.urlopen(urls[0]).read()
# Except for visited urls
except:
print urls[0]
# Get and Print Information
soup = BeautifulSoup(htmltext)
urls.pop(0)
info = soup.findAll("div", {"class":"story-heading"})
print info
# Number of URLs in stack
print len(urls)
# Append Incomplete Tags
for tag in soup.findAll('a',href=True):
tag['href'] = urlparse.urljoin(url,tag['href'])
if url in tag['href'] and tag['href'] not in visited:
urls.append(tag['href'])
visited.append(tag['href'])
您当前代码的问题是您放入队列 (urls
) 的 URL 指向同一页面,但指向不同的锚点,例如:
- https://weedmaps.com/dispensaries/shakeandbake#videos
- https://weedmaps.com/dispensaries/shakeandbake#weedmenu
换句话说,tag['href'] not in visited
条件不会过滤指向同一页面但指向不同锚点的不同 URL。
据我所知,您正在重新发明一个网络抓取框架。但是已经有一个可以节省您的时间,使您的网络抓取代码有条理和干净,并且会比您当前的解决方案快得多 - Scrapy
.
您需要 CrawlSpider
, configure the rules
才能访问链接,例如:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MachineSpider(CrawlSpider):
name = 'weedmaps'
allowed_domains = ['weedmaps.com']
start_urls = ['https://weedmaps.com/dispensaries/shakeandbake']
rules = [
Rule(LinkExtractor(allow=r'/dispensaries/'), callback='parse_hours')
]
def parse_hours(self, response):
print response.url
for hours in response.css('span[itemid="#store"] div.row.hours-row div.col-md-9'):
print hours.xpath('text()').extract()
您的回调应该 return 或产生 Item
个实例,而不是打印您稍后可以保存到文件、数据库或管道中以不同方式处理的实例。