无法在 Python 中编写网络爬虫
Cannot Write Web Crawler in Python
我在编写基本网络爬虫时遇到了问题。我想将大约 500 页的原始 html 写入文件。问题是我的搜索范围太广或太窄。它要么太深,永远无法通过第一个循环,要么不够深,returns什么都没有。
我试过在 find_all()
中使用 limit=
参数,但运气不好。
如有任何建议,我们将不胜感激。
from bs4 import BeautifulSoup
from urllib2 import urlopen
def crawler(seed_url):
to_crawl = [seed_url]
while to_crawl:
page = to_crawl.pop()
if page.startswith("http"):
page_source = urlopen(page)
s = page_source.read()
with open(str(page.replace("/","_"))+".txt","a+") as f:
f.write(s)
f.close()
soup = BeautifulSoup(s)
for link in soup.find_all('a', href=True,limit=5):
# print(link)
a = link['href']
if a.startswith("http"):
to_crawl.append(a)
if __name__ == "__main__":
crawler('http://www.nytimes.com/')
我修改了你的函数,所以它不写入文件,它只打印 url,这就是我得到的:
http://www.nytimes.com/
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
所以看起来你可以工作,但是有一个重定向循环。也许尝试将其重写为递归函数,这样您就可以进行深度优先搜索而不是广度优先搜索,我很确定现在正在发生这种情况。
编辑:这是一个递归函数:
def recursive_crawler(url, crawled):
if len(crawled) >= 500:
return
print url
page_source = urlopen(page)
s = page_source.read()
#write to file here, if desired
soup = BeautifulSoup(s)
for link in soup.find_all('a', href=True):
a = link['href']
if a != url and a.startswith("http") and a not in crawled:
crawled.add(a)
recursive_crawler(a, crawled)
向其传递一个空集以供已抓取:
c = set()
recursive_crawler('http://www.nytimes.com', c)
输出(几秒后我打断了):
http://www.nytimes.com
http://www.nytimes.com/content/help/site/ie8-support.html
http://international.nytimes.com
http://cn.nytimes.com
http://www.nytimes.com/
http://www.nytimes.com/pages/todayspaper/index.html
http://www.nytimes.com/video
http://www.nytimes.com/pages/world/index.html
http://www.nytimes.com/pages/national/index.html
http://www.nytimes.com/pages/politics/index.html
http://www.nytimes.com/pages/nyregion/index.html
http://www.nytimes.com/pages/business/index.html
感谢建议使用 already_crawled
集合的人
我在编写基本网络爬虫时遇到了问题。我想将大约 500 页的原始 html 写入文件。问题是我的搜索范围太广或太窄。它要么太深,永远无法通过第一个循环,要么不够深,returns什么都没有。
我试过在 find_all()
中使用 limit=
参数,但运气不好。
如有任何建议,我们将不胜感激。
from bs4 import BeautifulSoup
from urllib2 import urlopen
def crawler(seed_url):
to_crawl = [seed_url]
while to_crawl:
page = to_crawl.pop()
if page.startswith("http"):
page_source = urlopen(page)
s = page_source.read()
with open(str(page.replace("/","_"))+".txt","a+") as f:
f.write(s)
f.close()
soup = BeautifulSoup(s)
for link in soup.find_all('a', href=True,limit=5):
# print(link)
a = link['href']
if a.startswith("http"):
to_crawl.append(a)
if __name__ == "__main__":
crawler('http://www.nytimes.com/')
我修改了你的函数,所以它不写入文件,它只打印 url,这就是我得到的:
http://www.nytimes.com/
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
所以看起来你可以工作,但是有一个重定向循环。也许尝试将其重写为递归函数,这样您就可以进行深度优先搜索而不是广度优先搜索,我很确定现在正在发生这种情况。
编辑:这是一个递归函数:
def recursive_crawler(url, crawled):
if len(crawled) >= 500:
return
print url
page_source = urlopen(page)
s = page_source.read()
#write to file here, if desired
soup = BeautifulSoup(s)
for link in soup.find_all('a', href=True):
a = link['href']
if a != url and a.startswith("http") and a not in crawled:
crawled.add(a)
recursive_crawler(a, crawled)
向其传递一个空集以供已抓取:
c = set()
recursive_crawler('http://www.nytimes.com', c)
输出(几秒后我打断了):
http://www.nytimes.com
http://www.nytimes.com/content/help/site/ie8-support.html
http://international.nytimes.com
http://cn.nytimes.com
http://www.nytimes.com/
http://www.nytimes.com/pages/todayspaper/index.html
http://www.nytimes.com/video
http://www.nytimes.com/pages/world/index.html
http://www.nytimes.com/pages/national/index.html
http://www.nytimes.com/pages/politics/index.html
http://www.nytimes.com/pages/nyregion/index.html
http://www.nytimes.com/pages/business/index.html
感谢建议使用 already_crawled
集合的人