为什么我的 Kmeans CuPy 代码中有 "OutOfMemoryError"?
Why do i have "OutOfMemoryError" in my Kmeans CuPy code?
我对 gpu 编码真的很陌生我发现这个 Kmeans cupy 代码我的建议是使用大型数据库 (n,3) 例如实现 gpu 和 cpu 上的时间差异,我想有大量集群,但出现内存管理错误。谁能给我研究和修复它的路线,我已经研究了,但我还没有一个明确的开始。
import contextlib
import time
import cupy
import matplotlib.pyplot as plt
import numpy
@contextlib.contextmanager
def timer(message):
cupy.cuda.Stream.null.synchronize()
start = time.time()
yield
cupy.cuda.Stream.null.synchronize()
end = time.time()
print('%s: %f sec' % (message, end - start))
var_kernel = cupy.ElementwiseKernel(
'T x0, T x1, T c0, T c1', 'T out',
'out = (x0 - c0) * (x0 - c0) + (x1 - c1) * (x1 - c1)',
'var_kernel'
)
sum_kernel = cupy.ReductionKernel(
'T x, S mask', 'T out',
'mask ? x : 0',
'a + b', 'out = a', '0',
'sum_kernel'
)
count_kernel = cupy.ReductionKernel(
'T mask', 'float32 out',
'mask ? 1.0 : 0.0',
'a + b', 'out = a', '0.0',
'count_kernel'
)
def fit_xp(X, n_clusters, max_iter):
assert X.ndim == 2
# Get NumPy or CuPy module from the supplied array.
xp = cupy.get_array_module(X)
n_samples = len(X)
# Make an array to store the labels indicating which cluster each sample is
# contained.
pred = xp.zeros(n_samples)
# Choose the initial centroid for each cluster.
initial_indexes = xp.random.choice(n_samples, n_clusters, replace=False)
centers = X[initial_indexes]
for _ in range(max_iter):
# Compute the new label for each sample.
distances = xp.linalg.norm(X[:, None, :] - centers[None, :, :], axis=2)
new_pred = xp.argmin(distances, axis=1)
# If the label is not changed for each sample, we suppose the
# algorithm has converged and exit from the loop.
if xp.all(new_pred == pred):
break
pred = new_pred
# Compute the new centroid for each cluster.
i = xp.arange(n_clusters)
mask = pred == i[:, None]
sums = xp.where(mask[:, :, None], X, 0).sum(axis=1)
counts = xp.count_nonzero(mask, axis=1).reshape((n_clusters, 1))
centers = sums / counts
return centers, pred
def fit_custom(X, n_clusters, max_iter):
assert X.ndim == 2
n_samples = len(X)
pred = cupy.zeros(n_samples,dtype='float32')
initial_indexes = cupy.random.choice(n_samples, n_clusters, replace=False)
centers = X[initial_indexes]
for _ in range(max_iter):
distances = var_kernel(X[:, None, 0], X[:, None, 1],
centers[None, :, 1], centers[None, :, 0])
new_pred = cupy.argmin(distances, axis=1)
if cupy.all(new_pred == pred):
break
pred = new_pred
i = cupy.arange(n_clusters)
mask = pred == i[:, None]
sums = sum_kernel(X, mask[:, :, None], axis=1)
counts = count_kernel(mask, axis=1).reshape((n_clusters, 1))
centers = sums / counts
return centers, pred
def draw(X, n_clusters, centers, pred, output):
# Plot the samples and centroids of the fitted clusters into an image file.
for i in range(n_clusters):
labels = X[pred == i]
plt.scatter(labels[:, 0], labels[:, 1], c=numpy.random.rand(3))
plt.scatter(
centers[:, 0], centers[:, 1], s=120, marker='s', facecolors='y',
edgecolors='k')
plt.savefig(output)
def run_cpu(gpuid, n_clusters, num, max_iter, use_custom_kernel):##, output
samples = numpy.random.randn(num, 3)
X_train = numpy.r_[samples + 1, samples - 1]
with timer(' CPU '):
centers, pred = fit_xp(X_train, n_clusters, max_iter)
def run_gpu(gpuid, n_clusters, num, max_iter, use_custom_kernel):##, output
samples = numpy.random.randn(num, 3)
X_train = numpy.r_[samples + 1, samples - 1]
with cupy.cuda.Device(gpuid):
X_train = cupy.asarray(X_train)
with timer(' GPU '):
if use_custom_kernel:
centers, pred = fit_custom(X_train, n_clusters, max_iter)
else:
centers, pred = fit_xp(X_train, n_clusters, max_iter)
顺便说一句,我在 colab pro 25GB(RAM) 中工作,代码使用 n_clusters=200 和 num=1000000 但如果我使用更大的数字,则会出现错误,我是 运行代码如下:
run_gpu(0,200,1000000,10,True)
This is the error that i have
欢迎提出任何建议,感谢您的宝贵时间。
假设 CuPy 足够聪明,不会创建 var_kernel
的广播输入的显式副本,输出 distances
的大小必须为 2 * num * num_clusters
,正好是 6,400,000,000它试图分配的字节数。您可以通过从不实际将距离写入内存来减少内存占用,这意味着将 var_kernel
与 argmin
融合。请参阅文档的 this 部分。
如果我正确理解那里的例子,这应该有效:
@cupy.fuse(kernel_name='argmin_distance')
def argmin_distance(x1, y1, x2, y2):
return cupy.argmin((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2), axis = 1)
下一个问题是其他 13.7GB 从何而来。其中很大一部分可能只是早期迭代中 distances
的实例。我不是 CuPy 专家,但至少在 Python/Numpy 中,您在循环内使用距离不会重复使用相同的内存,而是在每次调用 var_kernel
时分配更多内存。在循环之前分配的 pred
也存在同样的问题。如果 CuPy 以 Numpy 的方式做事,解决方案就是把 [:]
放在那里,比如
pred[:] = new_pred
或
distances[:,:,:] = var_kernel(X[:, None, 0], X[:, None, 1],
centers[None, :, 1], centers[None, :, 0])
为此,您还需要在循环之前分配 distances
。此外,在使用内核融合时不再需要它,因此仅以它为例。最好事先分配所有内容,然后在循环中的任何地方使用此语法。
我对 CuPy 的了解不够,无法回答为什么 fit_xp
没有同样的问题(或者有?)。但我的猜测是,使用 CuPy 对象的垃圾收集在那里的工作方式不同。如果垃圾收集在 fit_custom
中“足够快”,即使没有内核融合或重用已经分配的数组,它也应该可以工作。
其他问题或至少与您的代码有关的问题:
- 为什么要比较
centers
的第零坐标和X
的第一个坐标?调用 不是更有意义吗
distances = var_kernel(X[:, None, 0], X[:, None, 1],
centers[None, :, 0], centers[None, :, 1])
- 为什么只使用 2D 平面上的投影来创建 3D 数据?那为什么不呢
samples = numpy.random.randn(num, 2)
- 为什么要为
pred
(的初始版本)使用浮点数? argmin
应该给出一个整数类型的结果。
我对 gpu 编码真的很陌生我发现这个 Kmeans cupy 代码我的建议是使用大型数据库 (n,3) 例如实现 gpu 和 cpu 上的时间差异,我想有大量集群,但出现内存管理错误。谁能给我研究和修复它的路线,我已经研究了,但我还没有一个明确的开始。
import contextlib
import time
import cupy
import matplotlib.pyplot as plt
import numpy
@contextlib.contextmanager
def timer(message):
cupy.cuda.Stream.null.synchronize()
start = time.time()
yield
cupy.cuda.Stream.null.synchronize()
end = time.time()
print('%s: %f sec' % (message, end - start))
var_kernel = cupy.ElementwiseKernel(
'T x0, T x1, T c0, T c1', 'T out',
'out = (x0 - c0) * (x0 - c0) + (x1 - c1) * (x1 - c1)',
'var_kernel'
)
sum_kernel = cupy.ReductionKernel(
'T x, S mask', 'T out',
'mask ? x : 0',
'a + b', 'out = a', '0',
'sum_kernel'
)
count_kernel = cupy.ReductionKernel(
'T mask', 'float32 out',
'mask ? 1.0 : 0.0',
'a + b', 'out = a', '0.0',
'count_kernel'
)
def fit_xp(X, n_clusters, max_iter):
assert X.ndim == 2
# Get NumPy or CuPy module from the supplied array.
xp = cupy.get_array_module(X)
n_samples = len(X)
# Make an array to store the labels indicating which cluster each sample is
# contained.
pred = xp.zeros(n_samples)
# Choose the initial centroid for each cluster.
initial_indexes = xp.random.choice(n_samples, n_clusters, replace=False)
centers = X[initial_indexes]
for _ in range(max_iter):
# Compute the new label for each sample.
distances = xp.linalg.norm(X[:, None, :] - centers[None, :, :], axis=2)
new_pred = xp.argmin(distances, axis=1)
# If the label is not changed for each sample, we suppose the
# algorithm has converged and exit from the loop.
if xp.all(new_pred == pred):
break
pred = new_pred
# Compute the new centroid for each cluster.
i = xp.arange(n_clusters)
mask = pred == i[:, None]
sums = xp.where(mask[:, :, None], X, 0).sum(axis=1)
counts = xp.count_nonzero(mask, axis=1).reshape((n_clusters, 1))
centers = sums / counts
return centers, pred
def fit_custom(X, n_clusters, max_iter):
assert X.ndim == 2
n_samples = len(X)
pred = cupy.zeros(n_samples,dtype='float32')
initial_indexes = cupy.random.choice(n_samples, n_clusters, replace=False)
centers = X[initial_indexes]
for _ in range(max_iter):
distances = var_kernel(X[:, None, 0], X[:, None, 1],
centers[None, :, 1], centers[None, :, 0])
new_pred = cupy.argmin(distances, axis=1)
if cupy.all(new_pred == pred):
break
pred = new_pred
i = cupy.arange(n_clusters)
mask = pred == i[:, None]
sums = sum_kernel(X, mask[:, :, None], axis=1)
counts = count_kernel(mask, axis=1).reshape((n_clusters, 1))
centers = sums / counts
return centers, pred
def draw(X, n_clusters, centers, pred, output):
# Plot the samples and centroids of the fitted clusters into an image file.
for i in range(n_clusters):
labels = X[pred == i]
plt.scatter(labels[:, 0], labels[:, 1], c=numpy.random.rand(3))
plt.scatter(
centers[:, 0], centers[:, 1], s=120, marker='s', facecolors='y',
edgecolors='k')
plt.savefig(output)
def run_cpu(gpuid, n_clusters, num, max_iter, use_custom_kernel):##, output
samples = numpy.random.randn(num, 3)
X_train = numpy.r_[samples + 1, samples - 1]
with timer(' CPU '):
centers, pred = fit_xp(X_train, n_clusters, max_iter)
def run_gpu(gpuid, n_clusters, num, max_iter, use_custom_kernel):##, output
samples = numpy.random.randn(num, 3)
X_train = numpy.r_[samples + 1, samples - 1]
with cupy.cuda.Device(gpuid):
X_train = cupy.asarray(X_train)
with timer(' GPU '):
if use_custom_kernel:
centers, pred = fit_custom(X_train, n_clusters, max_iter)
else:
centers, pred = fit_xp(X_train, n_clusters, max_iter)
顺便说一句,我在 colab pro 25GB(RAM) 中工作,代码使用 n_clusters=200 和 num=1000000 但如果我使用更大的数字,则会出现错误,我是 运行代码如下:
run_gpu(0,200,1000000,10,True)
This is the error that i have
欢迎提出任何建议,感谢您的宝贵时间。
假设 CuPy 足够聪明,不会创建 var_kernel
的广播输入的显式副本,输出 distances
的大小必须为 2 * num * num_clusters
,正好是 6,400,000,000它试图分配的字节数。您可以通过从不实际将距离写入内存来减少内存占用,这意味着将 var_kernel
与 argmin
融合。请参阅文档的 this 部分。
如果我正确理解那里的例子,这应该有效:
@cupy.fuse(kernel_name='argmin_distance')
def argmin_distance(x1, y1, x2, y2):
return cupy.argmin((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2), axis = 1)
下一个问题是其他 13.7GB 从何而来。其中很大一部分可能只是早期迭代中 distances
的实例。我不是 CuPy 专家,但至少在 Python/Numpy 中,您在循环内使用距离不会重复使用相同的内存,而是在每次调用 var_kernel
时分配更多内存。在循环之前分配的 pred
也存在同样的问题。如果 CuPy 以 Numpy 的方式做事,解决方案就是把 [:]
放在那里,比如
pred[:] = new_pred
或
distances[:,:,:] = var_kernel(X[:, None, 0], X[:, None, 1],
centers[None, :, 1], centers[None, :, 0])
为此,您还需要在循环之前分配 distances
。此外,在使用内核融合时不再需要它,因此仅以它为例。最好事先分配所有内容,然后在循环中的任何地方使用此语法。
我对 CuPy 的了解不够,无法回答为什么 fit_xp
没有同样的问题(或者有?)。但我的猜测是,使用 CuPy 对象的垃圾收集在那里的工作方式不同。如果垃圾收集在 fit_custom
中“足够快”,即使没有内核融合或重用已经分配的数组,它也应该可以工作。
其他问题或至少与您的代码有关的问题:
- 为什么要比较
centers
的第零坐标和X
的第一个坐标?调用 不是更有意义吗
distances = var_kernel(X[:, None, 0], X[:, None, 1],
centers[None, :, 0], centers[None, :, 1])
- 为什么只使用 2D 平面上的投影来创建 3D 数据?那为什么不呢
samples = numpy.random.randn(num, 2)
- 为什么要为
pred
(的初始版本)使用浮点数?argmin
应该给出一个整数类型的结果。