网络爬虫 class 不工作
Web crawler class not working
最近,我开始着手构建一个简单的网络爬虫。我刚迭代两次的初始代码运行得很好,但是当我试图将它变成带有错误异常处理的 class 时,它不再编译。
import re, urllib
class WebCrawler:
"""A Simple Web Crawler That Is Readily Extensible"""
def __init__():
size = 1
def containsAny(seq, aset):
for c in seq:
if c in aset: return True
return False
def crawlUrls(url, depth):
textfile = file('UrlMap.txt', 'wt')
urlList = [url]
size = 1
for i in range(depth):
for ee in range(size):
if containsAny(urlList[ee], "http://"):
try:
webpage = urllib.urlopen(urlList[ee]).read()
break
except:
print "Following URL failed!"
print urlList[ee]
for ee in re.findall('''href=["'](.[^"']+)["']''',webpage, re.I):
print ee
urlList.append(ee)
size+=1
textfile.write(ee+'\n')
myCrawler = WebCrawler
myCrawler.crawlUrls("http://www.wordsmakeworlds.com/", 2)
这是生成的错误代码。
Traceback (most recent call last):
File "C:/Users/Noah Huber-Feely/Desktop/Python/WebCrawlerClass", line 33, in <module>
myCrawler.crawlUrls("http://www.wordsmakeworlds.com/", 2)
TypeError: unbound method crawlUrls() must be called with WebCrawler instance as first argument (got str instance instead)
你有两个问题。一是一这一行:
myCrawler = WebCrawler
您不是在创建 WebCrawler
的实例,您只是将名称 myCrawler
绑定到 WebCrawler
(基本上,为 class 创建一个别名)。你应该这样做:
myCrawler = WebCrawler()
然后,在这一行:
def crawlUrls(url, depth):
Python 实例方法将接收者作为方法的第一个参数。它通常被称为 self
,但从技术上讲,您可以随意调用它。所以你应该将方法签名更改为:
def crawlUrls(self, url, depth):
(您还需要为您定义的其他方法执行此操作。)
最近,我开始着手构建一个简单的网络爬虫。我刚迭代两次的初始代码运行得很好,但是当我试图将它变成带有错误异常处理的 class 时,它不再编译。
import re, urllib
class WebCrawler:
"""A Simple Web Crawler That Is Readily Extensible"""
def __init__():
size = 1
def containsAny(seq, aset):
for c in seq:
if c in aset: return True
return False
def crawlUrls(url, depth):
textfile = file('UrlMap.txt', 'wt')
urlList = [url]
size = 1
for i in range(depth):
for ee in range(size):
if containsAny(urlList[ee], "http://"):
try:
webpage = urllib.urlopen(urlList[ee]).read()
break
except:
print "Following URL failed!"
print urlList[ee]
for ee in re.findall('''href=["'](.[^"']+)["']''',webpage, re.I):
print ee
urlList.append(ee)
size+=1
textfile.write(ee+'\n')
myCrawler = WebCrawler
myCrawler.crawlUrls("http://www.wordsmakeworlds.com/", 2)
这是生成的错误代码。
Traceback (most recent call last):
File "C:/Users/Noah Huber-Feely/Desktop/Python/WebCrawlerClass", line 33, in <module>
myCrawler.crawlUrls("http://www.wordsmakeworlds.com/", 2)
TypeError: unbound method crawlUrls() must be called with WebCrawler instance as first argument (got str instance instead)
你有两个问题。一是一这一行:
myCrawler = WebCrawler
您不是在创建 WebCrawler
的实例,您只是将名称 myCrawler
绑定到 WebCrawler
(基本上,为 class 创建一个别名)。你应该这样做:
myCrawler = WebCrawler()
然后,在这一行:
def crawlUrls(url, depth):
Python 实例方法将接收者作为方法的第一个参数。它通常被称为 self
,但从技术上讲,您可以随意调用它。所以你应该将方法签名更改为:
def crawlUrls(self, url, depth):
(您还需要为您定义的其他方法执行此操作。)