无法在 Python 中编写网络爬虫

Question

我在编写基本网络爬虫时遇到了问题。我想将大约 500 页的原始 html 写入文件。问题是我的搜索范围太广或太窄。它要么太深，永远无法通过第一个循环，要么不够深，returns什么都没有。

我试过在 find_all() 中使用 limit= 参数，但运气不好。

如有任何建议，我们将不胜感激。

from bs4 import BeautifulSoup
from urllib2 import urlopen

def crawler(seed_url):
    to_crawl = [seed_url]
    while to_crawl:
        page = to_crawl.pop()
        if page.startswith("http"):
            page_source = urlopen(page)
            s = page_source.read()

            with open(str(page.replace("/","_"))+".txt","a+") as f:
                f.write(s)
                f.close()
            soup = BeautifulSoup(s)
            for link in soup.find_all('a', href=True,limit=5):
                # print(link)
                a = link['href']
                if a.startswith("http"):
                    to_crawl.append(a)

if __name__ == "__main__":
    crawler('http://www.nytimes.com/')

Answer 1

我修改了你的函数，所以它不写入文件，它只打印 url，这就是我得到的：

http://www.nytimes.com/
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/

所以看起来你可以工作，但是有一个重定向循环。也许尝试将其重写为递归函数，这样您就可以进行深度优先搜索而不是广度优先搜索，我很确定现在正在发生这种情况。

编辑：这是一个递归函数：

def recursive_crawler(url, crawled):
    if len(crawled) >= 500:
        return
    print url
    page_source = urlopen(page)
    s = page_source.read()

    #write to file here, if desired

    soup = BeautifulSoup(s)
    for link in soup.find_all('a', href=True):
        a = link['href']
        if a != url and a.startswith("http") and a not in crawled:
            crawled.add(a)
            recursive_crawler(a, crawled)

向其传递一个空集以供已抓取：

c = set()
recursive_crawler('http://www.nytimes.com', c)

输出（几秒后我打断了）：

http://www.nytimes.com
http://www.nytimes.com/content/help/site/ie8-support.html
http://international.nytimes.com
http://cn.nytimes.com
http://www.nytimes.com/
http://www.nytimes.com/pages/todayspaper/index.html
http://www.nytimes.com/video
http://www.nytimes.com/pages/world/index.html
http://www.nytimes.com/pages/national/index.html
http://www.nytimes.com/pages/politics/index.html
http://www.nytimes.com/pages/nyregion/index.html
http://www.nytimes.com/pages/business/index.html

感谢建议使用 already_crawled 集合的人

无法在 Python 中编写网络爬虫

Cannot Write Web Crawler in Python

python

urllib2

beautifulsoup

web-crawler