Python 简单的网络爬虫错误(无限循环爬行)
Python simple web crawler error (infinite loop crawling)
我在python写了一个简单的爬虫。它似乎工作正常并找到新的链接,但重复查找相同的链接并且它没有下载找到的新网页。好像达到了设定的爬行深度限制就可以无限爬行了。我没有收到任何错误。它只是 运行 永远。这是代码和 运行。我在 Windows 7 64 位上使用 Python 2.7。
import sys
import time
from bs4 import *
import urllib2
import re
from urlparse import urljoin
def crawl(url):
url = url.strip()
page_file_name = str(hash(url))
page_file_name = page_file_name + ".html"
fh_page = open(page_file_name, "w")
fh_urls = open("urls.txt", "a")
fh_urls.write(url + "\n")
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page, "html.parser")
html_text = str(soup)
fh_page.write(url + "\n")
fh_page.write(page_file_name + "\n")
fh_page.write(html_text)
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
rs = []
for link in links:
try:
#r = urllib2.urlparse.urljoin(url, link)
r = urllib2.urlopen(link)
r_str = str(r.geturl())
fh_urls.write(r_str + "\n")
#a = urllib2.urlopen(r)
if r.headers['content-type'] == "html" and r.getcode() == 200:
rs.append(r)
print "Extracted link:"
print link
print "Extracted link final URL:"
print r
except urllib2.HTTPError as e:
print "There is an error crawling links in this page:"
print "Error Code:"
print e.code
return rs
fh_page.close()
fh_urls.close()
if __name__ == "__main__":
if len(sys.argv) != 3:
print "Usage: python crawl.py <seed_url> <crawling_depth>"
print "e.g: python crawl.py https://www.yahoo.com/ 5"
exit()
url = sys.argv[1]
depth = sys.argv[2]
print "Entered URL:"
print url
html_page = urllib2.urlopen(url)
print "Final URL:"
print html_page.geturl()
print "*******************"
url_list = [url, ]
current_depth = 0
while current_depth < depth:
for link in url_list:
new_links = crawl(link)
for new_link in new_links:
if new_link not in url_list:
url_list.append(new_link)
time.sleep(5)
current_depth += 1
print current_depth
这是我 运行 它得到的结果:
C:\Users\Hussam-Den\Desktop>python test.py https://www.yahoo.com/ 4
Entered URL:
https://www.yahoo.com/
Final URL:
https://www.yahoo.com/
*******************
1
存储抓取到的url的输出文件是这个:
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
知道哪里出了问题吗?
- 这里有一个错误:
depth = sys.argv[2]
、sys
return str
而不是 int
。你应该写 depth = int(sys.argv[2])
- 因为1分,条件
while current_depth < depth:
总是returnTrue
尝试通过将 argv[2]
转换为 int
来修复它。我觉得有错误
我在python写了一个简单的爬虫。它似乎工作正常并找到新的链接,但重复查找相同的链接并且它没有下载找到的新网页。好像达到了设定的爬行深度限制就可以无限爬行了。我没有收到任何错误。它只是 运行 永远。这是代码和 运行。我在 Windows 7 64 位上使用 Python 2.7。
import sys
import time
from bs4 import *
import urllib2
import re
from urlparse import urljoin
def crawl(url):
url = url.strip()
page_file_name = str(hash(url))
page_file_name = page_file_name + ".html"
fh_page = open(page_file_name, "w")
fh_urls = open("urls.txt", "a")
fh_urls.write(url + "\n")
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page, "html.parser")
html_text = str(soup)
fh_page.write(url + "\n")
fh_page.write(page_file_name + "\n")
fh_page.write(html_text)
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
rs = []
for link in links:
try:
#r = urllib2.urlparse.urljoin(url, link)
r = urllib2.urlopen(link)
r_str = str(r.geturl())
fh_urls.write(r_str + "\n")
#a = urllib2.urlopen(r)
if r.headers['content-type'] == "html" and r.getcode() == 200:
rs.append(r)
print "Extracted link:"
print link
print "Extracted link final URL:"
print r
except urllib2.HTTPError as e:
print "There is an error crawling links in this page:"
print "Error Code:"
print e.code
return rs
fh_page.close()
fh_urls.close()
if __name__ == "__main__":
if len(sys.argv) != 3:
print "Usage: python crawl.py <seed_url> <crawling_depth>"
print "e.g: python crawl.py https://www.yahoo.com/ 5"
exit()
url = sys.argv[1]
depth = sys.argv[2]
print "Entered URL:"
print url
html_page = urllib2.urlopen(url)
print "Final URL:"
print html_page.geturl()
print "*******************"
url_list = [url, ]
current_depth = 0
while current_depth < depth:
for link in url_list:
new_links = crawl(link)
for new_link in new_links:
if new_link not in url_list:
url_list.append(new_link)
time.sleep(5)
current_depth += 1
print current_depth
这是我 运行 它得到的结果:
C:\Users\Hussam-Den\Desktop>python test.py https://www.yahoo.com/ 4
Entered URL:
https://www.yahoo.com/
Final URL:
https://www.yahoo.com/
*******************
1
存储抓取到的url的输出文件是这个:
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
知道哪里出了问题吗?
- 这里有一个错误:
depth = sys.argv[2]
、sys
returnstr
而不是int
。你应该写depth = int(sys.argv[2])
- 因为1分,条件
while current_depth < depth:
总是returnTrue
尝试通过将 argv[2]
转换为 int
来修复它。我觉得有错误