加快从网络下载文件的处理速度

Question

我正在编写一个程序，它必须先从网上下载一堆文件，然后才能运行，所以我创建了一个函数来下载所有文件和 "initialize" 名为 init_program 的程序，它是如何工作的运行通过一对 dicts 有指向 github 上的 gistfiles 的 url。它提取 url 并使用 urllib2 下载它们。我无法添加所有文件，但您可以通过克隆存储库 here 来尝试。这是将从 gists 创建文件的函数：

def init_program():
    """ Initialize the program and allow all the files to be downloaded
        This will take awhile to process, but I'm working on the processing
        speed """

    downloaded_wordlists = []  # Used to count the amount of items downloaded
    downloaded_rainbow_tables = []

    print("\n")
    banner("Initializing program and downloading files, this may take awhile..")
    print("\n")

    # INIT_FILE is a file that will contain "false" if the program is not initialized
    # And "true" if the program is initialized
    with open(INIT_FILE) as data: 
        if data.read() == "false": 
            for item in GIST_DICT_LINKS.keys():
                sys.stdout.write("\rDownloading {} out of {} wordlists.. ".format(len(downloaded_wordlists) + 1, 
                                                                                  len(GIST_DICT_LINKS.keys())))
                sys.stdout.flush()
                new_wordlist = open("dicts/included_dicts/wordlists/{}.txt".format(item), "a+") 
                # Download the wordlists and save them into a file
                wordlist_data = urllib2.urlopen(GIST_DICT_LINKS[item])
                new_wordlist.write(wordlist_data.read())
                downloaded_wordlists.append(item + ".txt")
                new_wordlist.close()

            print("\n")
            banner("Done with wordlists, moving to rainbow tables..")
            print("\n")

            for table in GIST_RAINBOW_LINKS.keys():
                sys.stdout.write("\rDownloading {} out of {} rainbow tables".format(len(downloaded_rainbow_tables) + 1, 
                                                                                    len(GIST_RAINBOW_LINKS.keys())))
                new_rainbowtable = open("dicts/included_dicts/rainbow_tables/{}.rtc".format(table))
                # Download the rainbow tables and save them into a file
                rainbow_data = urllib2.urlopen(GIST_RAINBOW_LINKS[table])
                new_rainbowtable.write(rainbow_data.read())
                downloaded_rainbow_tables.append(table + ".rtc")
                new_rainbowtable.close()

            open(data, "w").write("true").close()  # Will never be initialized again
        else:
            pass

    return downloaded_wordlists, downloaded_rainbow_tables

这行得通，是的，但是它非常慢，由于文件的大小，每个文件至少有 100,000 行。我怎样才能加快这个功能，让它更快、更友好？

Answer 1

几周前，我遇到了类似的情况，需要下载许多大文件，但我发现所有简单的纯 Python 解决方案在下载优化方面都不够好。所以我找到了 Axel — Linux 和 Unix

的轻型命令行下载加速器

What is Axel?

Axel tries to accelerate the downloading process by using multiple connections for one file, similar to DownThemAll and other famous programs. It can also use multiple mirrors for one download.

Using Axel, you will get files faster from Internet. So, Axel can speed up a download up to 60% (approximately, according to some tests).

Usage: axel [options] url1 [url2] [url...]

--max-speed=x       -s x    Specify maximum speed (bytes per second)
--num-connections=x -n x    Specify maximum number of connections
--output=f      -o f    Specify local output file
--search[=x]        -S [x]  Search for mirrors and download from x servers
--header=x      -H x    Add header string
--user-agent=x      -U x    Set user agent
--no-proxy      -N  Just don't use any proxy server
--quiet         -q  Leave stdout alone
--verbose       -v  More status information
--alternate     -a  Alternate progress indicator
--help          -h  This information
--version       -V  Version information

作为 axel is written in C and there's no C extension for Python, so I used the subprocess 外部执行他的模块，对我来说非常完美。

你可以这样做：

cmd = ['/usr/local/bin/axel', '-n', str(n_connections), '-o',
               "{0}".format(filename), url]
process = subprocess.Popen(cmd,stdin=subprocess.PIPE, stdout=subprocess.PIPE)

您还可以通过解析标准输出的输出来解析每次下载的进度。

    while True:
        line = process.stdout.readline()
        progress = YOUR_GREAT_REGEX.match(line).groups()
        ...

Answer 2

您在等待每次下载时都处于阻塞状态。所以总时间是每次下载的往返时间之和。您的代码可能会花费大量时间等待网络流量。改善这一点的一种方法是在等待每个响应时不要阻塞。您可以通过多种方式执行此操作。例如，通过将每个请求移交给单独的线程（或进程），或使用事件循环和协程。阅读线程和异步模块。

加快从网络下载文件的处理速度

Speeding up process speed of file downloads from the web

python

performance

gist

urllib2

python-2.7