访问共享内存然后从文件加载需要更长的时间?
It takes longer to access shared memory then to load from file?
我有一个非常大的文件加载到我的主进程中。我的目标是同时从内存中读取多个进程以避免内存限制并使其更快。
根据 answer, I should use Shared ctypes Objects
Manager types are built for flexibility not efficiency ... this necessarily means copying whatever object is in question. .... If you want shared physical memory, I suggest using Shared ctypes Objects. These actually do point to a common location in memory, and therefore are much faster, and resource-light.
所以我这样做了:
import time
import pickle
import multiprocessing
from functools import partial
def foo(_, v):
tp = time.time()
v = v.value
print(hex(id(v)))
print(f'took me {time.time()-tp} in process')
if __name__ == '__main__':
# creates a file which is about 800 MB
with open('foo.pkl', 'wb') as file:
pickle.dump('aaabbbaa'*int(1e8), file, protocol=pickle.HIGHEST_PROTOCOL)
t1 = time.time()
with open('foo.pkl', 'rb') as file:
contract_conversion = pickle.load(file)
print(f'load took {time.time()-t1}')
m = multiprocessing.Manager()
vm = m.Value(str, contract_conversion, lock=False) # not locked because i only read from it so its safe
foo_p = partial(foo, v=vm)
tpo = time.time()
with multiprocessing.Pool() as pool:
pool.map(foo_p, range(4))
print(f'took me {time.time()-tpo} for pool stuff')
但是我可以看到进程使用了一个副本(每个进程中的 ram 都非常高)并且它比简单地从磁盘读取慢得多。
打印:
load took 0.8662333488464355
0x1c736ca0040
took me 2.286606550216675 in process
0x15cc0404040
took me 3.178203582763672 in process
0x1f30f049040
took me 4.179721355438232 in process
0x21d2c8cc040
took me 4.913192510604858 in process
took me 5.251579999923706 for pool stuff
id 也不相同,但我不确定 id 是否只是一个 python 标识符或内存位置。
您没有使用共享内存。那将是 multiprocessing.Value
,而不是 multiprocessing.Manager().Value
。您将字符串存储在管理器的服务器进程中,并通过 TLS 连接发送 pickle 以访问该值。此外,服务器进程在处理请求时受其自身 GIL 的限制。
我不知道每个方面对开销的贡献有多大,但总的来说它比读取共享内存更昂贵。
我有一个非常大的文件加载到我的主进程中。我的目标是同时从内存中读取多个进程以避免内存限制并使其更快。
根据
Manager types are built for flexibility not efficiency ... this necessarily means copying whatever object is in question. .... If you want shared physical memory, I suggest using Shared ctypes Objects. These actually do point to a common location in memory, and therefore are much faster, and resource-light.
所以我这样做了:
import time
import pickle
import multiprocessing
from functools import partial
def foo(_, v):
tp = time.time()
v = v.value
print(hex(id(v)))
print(f'took me {time.time()-tp} in process')
if __name__ == '__main__':
# creates a file which is about 800 MB
with open('foo.pkl', 'wb') as file:
pickle.dump('aaabbbaa'*int(1e8), file, protocol=pickle.HIGHEST_PROTOCOL)
t1 = time.time()
with open('foo.pkl', 'rb') as file:
contract_conversion = pickle.load(file)
print(f'load took {time.time()-t1}')
m = multiprocessing.Manager()
vm = m.Value(str, contract_conversion, lock=False) # not locked because i only read from it so its safe
foo_p = partial(foo, v=vm)
tpo = time.time()
with multiprocessing.Pool() as pool:
pool.map(foo_p, range(4))
print(f'took me {time.time()-tpo} for pool stuff')
但是我可以看到进程使用了一个副本(每个进程中的 ram 都非常高)并且它比简单地从磁盘读取慢得多。
打印:
load took 0.8662333488464355
0x1c736ca0040
took me 2.286606550216675 in process
0x15cc0404040
took me 3.178203582763672 in process
0x1f30f049040
took me 4.179721355438232 in process
0x21d2c8cc040
took me 4.913192510604858 in process
took me 5.251579999923706 for pool stuff
id 也不相同,但我不确定 id 是否只是一个 python 标识符或内存位置。
您没有使用共享内存。那将是 multiprocessing.Value
,而不是 multiprocessing.Manager().Value
。您将字符串存储在管理器的服务器进程中,并通过 TLS 连接发送 pickle 以访问该值。此外,服务器进程在处理请求时受其自身 GIL 的限制。
我不知道每个方面对开销的贡献有多大,但总的来说它比读取共享内存更昂贵。