Python 多处理期间的标准输出
Python stdout during multiprocessing
我正在 运行 浏览网站,我想打印一个计数器来显示进度。我在串行处理期间有这个工作。 (这是一个两步刮)
from multiprocessing import Pool
from sys import stdout
from bs4 import BeautifulSoup
global searched_counter,processed_counter
searched_counter = 0
processed_counter = 0
def run_scrape(var_input):
global searched_counter,processed_counter
#get search results
parsed = #parse using bs4
searched_counter += 1
stdout.write("\rTotal Searched/Processed: %d/%d" % (searched_counter,processed_counter))
stdout.flush()
if parsed: #only go to next page if result is what I want
#get the page I want using parsed data
#parse some more and write out to file
processed_counter += 1
stdout.write("\rTotal Searched/Processed: %d/%d" % (searched_counter,processed_counter))
stdout.flush()
list_to_scrape = ["data%05d" % (x,) for x in range(1,10000)]
pool = Pool(8)
pool.map(run_scrape,list_to_scrape)
stdout.write('\n')
当我 运行 使用多处理时,它会变得混乱并打印出大量随机数,这些随机数加起来与它实际写入文件的内容不相符...
将列表分成一定大小 (n) 的组(可能是池中数字的倍数),然后遍历该超级列表,为每个组创建一个新池。您可以在遍历超级列表时进行计数。
正常的 Python 变量不能在进程之间共享,因此池中的每个工作进程最终都有自己的 searched_counter
和 processed_counter
副本,因此将它们递增一个过程不会对其他过程产生任何影响。 multiprocessing
图书馆有 a few ways to share state between processes, but the easiest one for your use-case is to use a multiprocessing.Value
:
from multiprocessing import Pool, Value
from sys import stdout
def init(s, p):
global searched_counter, processed_counter
searched_counter = s
processed_counter = p
def run_scrape(var_input):
global searched_counter, processed_counter
#get search results
parsed = #parse using bs4
with searched_counter.get_lock():
searched_counter.value += 1
stdout.write("\rTotal Searched/Processed: %d/%d" %
(searched_counter.value, processed_counter.value))
stdout.flush()
if parsed:
with processed_counter.get_lock():
processed_counter.value += 1
stdout.write("\rTotal Searched/Processed: %d/%d" %
(searched_counter.value, processed_counter.value))
stdout.flush()
if __name__ == "__main__":
searched_counter = Value('i', 0)
processed_counter = Value('i', 0)
list_to_scrape = ["data%05d" % (x,) for x in range(1,10000)]
pool = Pool(8, initializer=init, initargs=(searched_counter, processed_counter))
pool.map(run_scrape, list_to_scrape)
stdout.write('\n')
请注意,我使用 initializer
/initargs
关键字参数显式地将计数器从父进程传递给子进程,这是一个 best practice 并有助于确保 Windows 兼容性。
我正在 运行 浏览网站,我想打印一个计数器来显示进度。我在串行处理期间有这个工作。 (这是一个两步刮)
from multiprocessing import Pool
from sys import stdout
from bs4 import BeautifulSoup
global searched_counter,processed_counter
searched_counter = 0
processed_counter = 0
def run_scrape(var_input):
global searched_counter,processed_counter
#get search results
parsed = #parse using bs4
searched_counter += 1
stdout.write("\rTotal Searched/Processed: %d/%d" % (searched_counter,processed_counter))
stdout.flush()
if parsed: #only go to next page if result is what I want
#get the page I want using parsed data
#parse some more and write out to file
processed_counter += 1
stdout.write("\rTotal Searched/Processed: %d/%d" % (searched_counter,processed_counter))
stdout.flush()
list_to_scrape = ["data%05d" % (x,) for x in range(1,10000)]
pool = Pool(8)
pool.map(run_scrape,list_to_scrape)
stdout.write('\n')
当我 运行 使用多处理时,它会变得混乱并打印出大量随机数,这些随机数加起来与它实际写入文件的内容不相符...
将列表分成一定大小 (n) 的组(可能是池中数字的倍数),然后遍历该超级列表,为每个组创建一个新池。您可以在遍历超级列表时进行计数。
正常的 Python 变量不能在进程之间共享,因此池中的每个工作进程最终都有自己的 searched_counter
和 processed_counter
副本,因此将它们递增一个过程不会对其他过程产生任何影响。 multiprocessing
图书馆有 a few ways to share state between processes, but the easiest one for your use-case is to use a multiprocessing.Value
:
from multiprocessing import Pool, Value
from sys import stdout
def init(s, p):
global searched_counter, processed_counter
searched_counter = s
processed_counter = p
def run_scrape(var_input):
global searched_counter, processed_counter
#get search results
parsed = #parse using bs4
with searched_counter.get_lock():
searched_counter.value += 1
stdout.write("\rTotal Searched/Processed: %d/%d" %
(searched_counter.value, processed_counter.value))
stdout.flush()
if parsed:
with processed_counter.get_lock():
processed_counter.value += 1
stdout.write("\rTotal Searched/Processed: %d/%d" %
(searched_counter.value, processed_counter.value))
stdout.flush()
if __name__ == "__main__":
searched_counter = Value('i', 0)
processed_counter = Value('i', 0)
list_to_scrape = ["data%05d" % (x,) for x in range(1,10000)]
pool = Pool(8, initializer=init, initargs=(searched_counter, processed_counter))
pool.map(run_scrape, list_to_scrape)
stdout.write('\n')
请注意,我使用 initializer
/initargs
关键字参数显式地将计数器从父进程传递给子进程,这是一个 best practice 并有助于确保 Windows 兼容性。