Urllib urlopen/urlretrieve 太多打开的文件错误
Urllib urlopen/urlretrieve too many open files error
问题
我正在尝试从 ftp 服务器并行下载 >100.000 个文件(使用线程)。我之前用 urlretrieve 试过它作为回答 here, however this gave me the following error: URLError(OSError(24, 'Too many open files'))
. Apparently this problem is a bug (cannot find the reference anymore), so I tried to use urlopen
in combination with shutil
and then write it to file which I could close myself, as described here。这似乎工作正常,但后来我又遇到了同样的错误:URLError(OSError(24, 'Too many open files'))
。我认为每当写入文件不完整或失败时,with
语句将导致文件自行关闭,但似乎文件仍保持打开状态并最终导致脚本停止。
问题
如何防止此错误,即确保每个文件都已关闭?
代码
import csv
import urllib.request
import shutil
from multiprocessing.dummy import Pool
def url_to_filename(url):
filename = 'patric_genomes/' + url.split('/')[-1]
return filename
def download(url):
url = url.strip()
try:
with urllib.request.urlopen(url) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
except Exception as e:
return None, e
def build_urls(id_list):
base_url = 'ftp://some_ftp_server/'
urls = []
for some_id in id_list:
url = base_url + some_id + '/' + some_id + '.fna'
print(url)
urls.append(url)
return urls
if __name__ == "__main__":
with open('full_data/genome_ids.txt') as inFile:
reader = csv.DictReader(inFile, delimiter = '\t')
ids = {row['some_id'] for row in reader}
urls = build_urls(ids)
p = Pool(100)
print(p.map(download, urls))
您可以尝试使用 contextlib
关闭您的文件:
import contextlib
[ ... ]
with contextlib.closing(urllib.request.urlopen(url)) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
[ ... ]
根据 docs:
contextlib.closing(thing)
Return a context manager that closes thing upon completion of the block. [ ... ] without needing to explicitly close page. Even if an error occurs, page.close() will be called when the with block is exited.
*** 解决方法是提高 Linux OS 上的打开文件限制。检查您当前的打开文件限制:
ulimit -Hn
在您的 /etc/sysctl.conf
文件中添加以下行:
fs.file-max = <number>
其中<number>
是您要设置的新的打开文件数上限。
关闭并保存文件。
sysctl -p
以便更改生效。
我认为您创建的文件处理程序没有被系统及时处理,因为关闭连接需要一些时间。所以你很快就用完了所有的免费文件处理程序(包括网络套接字)。
您所做的是为每个文件设置 FTP 连接。这是一种不好的做法。更好的方法是打开 5-15 个连接并重新使用它们,通过 现有 套接字下载文件,而无需为每个文件进行初始 FTP 握手的开销。请参阅 this post 以供参考。
P.S。另外,正如@Tarun_Lalwani 提到的,创建一个包含超过 ~1000 个文件的文件夹不是一个好主意,因为它会降低文件系统的速度。
How can I prevent this erorr, i.e. make sure that every files get closed?
要防止错误,您需要 increase open file limit,或者,降低线程池中的并发性更合理。连接和文件关闭由上下文管理器正确完成。
您的线程池有 100 个线程并打开至少 200 个句柄(一个用于 FTP 连接,另一个用于文件)。合理的并发数约为 10-30 个线程。
这是显示代码没问题的简化复制。在当前目录下somefile
放一些内容。
test.py
#!/usr/bin/env python3
import sys
import shutil
import logging
from pathlib import Path
from urllib.request import urlopen
from multiprocessing.dummy import Pool as ThreadPool
def download(id):
ftp_url = sys.argv[1]
filename = Path(__name__).parent / 'files'
try:
with urlopen(ftp_url) as src, (filename / id).open('wb') as dst:
shutil.copyfileobj(src, dst)
except Exception as e:
logging.exception('Download error')
if __name__ == '__main__':
with ThreadPool(10) as p:
p.map(download, (str(i).zfill(4) for i in range(1000)))
然后在同一目录中:
$ docker run --name=ftp-test -d -e FTP_USER=user -e FTP_PASSWORD=pass \
-v `pwd`/somefile:/srv/dir/somefile panubo/vsftpd vsftpd /etc/vsftpd.conf
$ IP=`docker inspect --format '{{ .NetworkSettings.IPAddress }}' ftp-test`
$ curl ftp://user:pass@$IP/dir/somefile
$ python3 client.py ftp://user:pass@$IP/dir/somefile
$ docker stop ftp-test && docker rm -v ftp-test
问题
我正在尝试从 ftp 服务器并行下载 >100.000 个文件(使用线程)。我之前用 urlretrieve 试过它作为回答 here, however this gave me the following error: URLError(OSError(24, 'Too many open files'))
. Apparently this problem is a bug (cannot find the reference anymore), so I tried to use urlopen
in combination with shutil
and then write it to file which I could close myself, as described here。这似乎工作正常,但后来我又遇到了同样的错误:URLError(OSError(24, 'Too many open files'))
。我认为每当写入文件不完整或失败时,with
语句将导致文件自行关闭,但似乎文件仍保持打开状态并最终导致脚本停止。
问题
如何防止此错误,即确保每个文件都已关闭?
代码
import csv
import urllib.request
import shutil
from multiprocessing.dummy import Pool
def url_to_filename(url):
filename = 'patric_genomes/' + url.split('/')[-1]
return filename
def download(url):
url = url.strip()
try:
with urllib.request.urlopen(url) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
except Exception as e:
return None, e
def build_urls(id_list):
base_url = 'ftp://some_ftp_server/'
urls = []
for some_id in id_list:
url = base_url + some_id + '/' + some_id + '.fna'
print(url)
urls.append(url)
return urls
if __name__ == "__main__":
with open('full_data/genome_ids.txt') as inFile:
reader = csv.DictReader(inFile, delimiter = '\t')
ids = {row['some_id'] for row in reader}
urls = build_urls(ids)
p = Pool(100)
print(p.map(download, urls))
您可以尝试使用 contextlib
关闭您的文件:
import contextlib
[ ... ]
with contextlib.closing(urllib.request.urlopen(url)) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
[ ... ]
根据 docs:
contextlib.closing(thing)
Return a context manager that closes thing upon completion of the block. [ ... ] without needing to explicitly close page. Even if an error occurs, page.close() will be called when the with block is exited.
*** 解决方法是提高 Linux OS 上的打开文件限制。检查您当前的打开文件限制:
ulimit -Hn
在您的 /etc/sysctl.conf
文件中添加以下行:
fs.file-max = <number>
其中<number>
是您要设置的新的打开文件数上限。
关闭并保存文件。
sysctl -p
以便更改生效。
我认为您创建的文件处理程序没有被系统及时处理,因为关闭连接需要一些时间。所以你很快就用完了所有的免费文件处理程序(包括网络套接字)。
您所做的是为每个文件设置 FTP 连接。这是一种不好的做法。更好的方法是打开 5-15 个连接并重新使用它们,通过 现有 套接字下载文件,而无需为每个文件进行初始 FTP 握手的开销。请参阅 this post 以供参考。
P.S。另外,正如@Tarun_Lalwani 提到的,创建一个包含超过 ~1000 个文件的文件夹不是一个好主意,因为它会降低文件系统的速度。
How can I prevent this erorr, i.e. make sure that every files get closed?
要防止错误,您需要 increase open file limit,或者,降低线程池中的并发性更合理。连接和文件关闭由上下文管理器正确完成。
您的线程池有 100 个线程并打开至少 200 个句柄(一个用于 FTP 连接,另一个用于文件)。合理的并发数约为 10-30 个线程。
这是显示代码没问题的简化复制。在当前目录下somefile
放一些内容。
test.py
#!/usr/bin/env python3
import sys
import shutil
import logging
from pathlib import Path
from urllib.request import urlopen
from multiprocessing.dummy import Pool as ThreadPool
def download(id):
ftp_url = sys.argv[1]
filename = Path(__name__).parent / 'files'
try:
with urlopen(ftp_url) as src, (filename / id).open('wb') as dst:
shutil.copyfileobj(src, dst)
except Exception as e:
logging.exception('Download error')
if __name__ == '__main__':
with ThreadPool(10) as p:
p.map(download, (str(i).zfill(4) for i in range(1000)))
然后在同一目录中:
$ docker run --name=ftp-test -d -e FTP_USER=user -e FTP_PASSWORD=pass \
-v `pwd`/somefile:/srv/dir/somefile panubo/vsftpd vsftpd /etc/vsftpd.conf
$ IP=`docker inspect --format '{{ .NetworkSettings.IPAddress }}' ftp-test`
$ curl ftp://user:pass@$IP/dir/somefile
$ python3 client.py ftp://user:pass@$IP/dir/somefile
$ docker stop ftp-test && docker rm -v ftp-test