Python 列表追加慢?
Python List Append Slow?
我必须将两个文本文件合并为一个,并从中创建一个新列表。第一个包含 url 和另一个 urlpaths/folder,它们必须应用于每个 url。我正在处理列表,它真的很慢,因为它大约有 200,000 个项目。
样本:
urls.txt:
http://wwww.google.com
....
paths.txt:
/abc
/bce
....
稍后,循环完成后,应该有一个新列表
http://wwww.google.com/abc
http://wwww.google.com/bce
Python代码:
URLS_TO_CHECK = [] #defined as global, needed later
def generate_list():
urls = open("urls.txt", "r").read().splitlines()
paths = open("paths.txt", "r").read().splitlines()
done = open("done.txt", "r").read().splitlines() #old done urls
for i in range(len(urls)):
for x in range(len(paths)):
url = re.search('(http://(.+?)....)', urls[i]) #needed
url = "%s%s" %(url.group(1), paths[x])
if url not in URLS_TO_CHECK:
if url not in done:
URLS_TO_CHECK.append(url) ##<<< slow!
已经阅读了一些关于 map
函数的其他帖子,禁用 gc
,但无法在我的程序中使用 map
函数。并禁用 gc
并没有真正帮助。
URLS_TO_CHECK = set(re.findall("'http://(.+?)....'",open("urls.txt", "r").read()))
for url in URLS_TO_CHECK:
for path in paths:
check_url(url+path)
可能会快得多......我认为它本质上是一样的....
与列表相比,在字典中搜索速度更快 Python: List vs Dict for look up table
URLS_TO_CHECK = {} #defined as global, needed later
def generate_list():
urls = open("urls.txt", "r").read().splitlines()
paths = open("paths.txt", "r").read().splitlines()
done = dict([(l, True) for l in open("done.txt", "r").read().splitlines()]) #old done urls
for i in range(len(urls)):
for x in range(len(paths)):
url = re.search('(http://(.+?)....)', urls[i]) #needed
url = "%s%s" %(url.group(1), paths[x])
if not url in URLS_TO_CHECK:
if not url in done:
URLS_TO_CHECK[url] = True #Result in URLS_TO_CHECK.keys()
这种方法利用了诸如:
- 在集合中快速查找 - O(1) 而不是 O(n)
- 按需生成值而不是一次构建整个列表
- 分块读取文件而不是一次加载整个数据
- 避免不必要的正则表达式
def yield_urls():
with open("paths.txt") as f:
paths = f.readlines() # needed in each iteration and iterates over, may be list
with open("done.txt") as f:
done_urls = set(f.readlines()) # needed in each iteration and looked up, set is O(1) vs O(n) in list
# resources are cleaned up after with
with open("urls.txt", "r") as f:
for url in f: # iterate over list, not big list of ints generated before iteratiob, much quicker
for subpath in paths:
full_url = ''.join((url[7:], subpath)) # no regex means faster, maybe string formatting is quicker than join, you need to check
# also, take care about trailing newlines in strings read from file
if full_url not in done_urls: # fast lookup in set
yield full_url # yield instead of appending
# usage
for url in yield_urls():
pass # to something with url
我必须将两个文本文件合并为一个,并从中创建一个新列表。第一个包含 url 和另一个 urlpaths/folder,它们必须应用于每个 url。我正在处理列表,它真的很慢,因为它大约有 200,000 个项目。
样本:
urls.txt:
http://wwww.google.com
....
paths.txt:
/abc
/bce
....
稍后,循环完成后,应该有一个新列表
http://wwww.google.com/abc
http://wwww.google.com/bce
Python代码:
URLS_TO_CHECK = [] #defined as global, needed later
def generate_list():
urls = open("urls.txt", "r").read().splitlines()
paths = open("paths.txt", "r").read().splitlines()
done = open("done.txt", "r").read().splitlines() #old done urls
for i in range(len(urls)):
for x in range(len(paths)):
url = re.search('(http://(.+?)....)', urls[i]) #needed
url = "%s%s" %(url.group(1), paths[x])
if url not in URLS_TO_CHECK:
if url not in done:
URLS_TO_CHECK.append(url) ##<<< slow!
已经阅读了一些关于 map
函数的其他帖子,禁用 gc
,但无法在我的程序中使用 map
函数。并禁用 gc
并没有真正帮助。
URLS_TO_CHECK = set(re.findall("'http://(.+?)....'",open("urls.txt", "r").read()))
for url in URLS_TO_CHECK:
for path in paths:
check_url(url+path)
可能会快得多......我认为它本质上是一样的....
与列表相比,在字典中搜索速度更快 Python: List vs Dict for look up table
URLS_TO_CHECK = {} #defined as global, needed later
def generate_list():
urls = open("urls.txt", "r").read().splitlines()
paths = open("paths.txt", "r").read().splitlines()
done = dict([(l, True) for l in open("done.txt", "r").read().splitlines()]) #old done urls
for i in range(len(urls)):
for x in range(len(paths)):
url = re.search('(http://(.+?)....)', urls[i]) #needed
url = "%s%s" %(url.group(1), paths[x])
if not url in URLS_TO_CHECK:
if not url in done:
URLS_TO_CHECK[url] = True #Result in URLS_TO_CHECK.keys()
这种方法利用了诸如:
- 在集合中快速查找 - O(1) 而不是 O(n)
- 按需生成值而不是一次构建整个列表
- 分块读取文件而不是一次加载整个数据
- 避免不必要的正则表达式
def yield_urls():
with open("paths.txt") as f:
paths = f.readlines() # needed in each iteration and iterates over, may be list
with open("done.txt") as f:
done_urls = set(f.readlines()) # needed in each iteration and looked up, set is O(1) vs O(n) in list
# resources are cleaned up after with
with open("urls.txt", "r") as f:
for url in f: # iterate over list, not big list of ints generated before iteratiob, much quicker
for subpath in paths:
full_url = ''.join((url[7:], subpath)) # no regex means faster, maybe string formatting is quicker than join, you need to check
# also, take care about trailing newlines in strings read from file
if full_url not in done_urls: # fast lookup in set
yield full_url # yield instead of appending
# usage
for url in yield_urls():
pass # to something with url