Numpy 仅按行打乱多维数组,保持列顺序不变
Numpy shuffle multidimensional array by row only, keep column order unchanged
如何仅在 Python 中按行打乱多维数组(因此不要打乱列)。
我正在寻找最有效的解决方案,因为我的矩阵非常庞大。是否也可以在原始数组上高效地执行此操作(以节省内存)?
示例:
import numpy as np
X = np.random.random((6, 2))
print(X)
Y = ???shuffle by row only not colls???
print(Y)
我现在期望的是原始矩阵:
[[ 0.48252164 0.12013048]
[ 0.77254355 0.74382174]
[ 0.45174186 0.8782033 ]
[ 0.75623083 0.71763107]
[ 0.26809253 0.75144034]
[ 0.23442518 0.39031414]]
输出随机排列行而不是列,例如:
[[ 0.45174186 0.8782033 ]
[ 0.48252164 0.12013048]
[ 0.77254355 0.74382174]
[ 0.75623083 0.71763107]
[ 0.23442518 0.39031414]
[ 0.26809253 0.75144034]]
您可以使用 numpy.random.shuffle()
.
This function only shuffles the array along the first axis of a
multi-dimensional array. The order of sub-arrays is changed but their
contents remains the same.
In [2]: import numpy as np
In [3]:
In [3]: X = np.random.random((6, 2))
In [4]: X
Out[4]:
array([[0.71935047, 0.25796155],
[0.4621708 , 0.55140423],
[0.22605866, 0.61581771],
[0.47264172, 0.79307633],
[0.22701656, 0.11927993],
[0.20117207, 0.2754544 ]])
In [5]: np.random.shuffle(X)
In [6]: X
Out[6]:
array([[0.71935047, 0.25796155],
[0.47264172, 0.79307633],
[0.4621708 , 0.55140423],
[0.22701656, 0.11927993],
[0.20117207, 0.2754544 ],
[0.22605866, 0.61581771]])
对于其他功能,您还可以查看以下功能:
在 Numpy 的 1.20.0 版本中引入了函数 random.Generator.permuted
。
The new function differs from shuffle
and permutation
in that the
subarrays indexed by an axis are permuted rather than the axis being
treated as a separate 1-D array for every combination of the other
indexes. For example, it is now possible to permute the rows or
columns of a 2-D array.
您还可以将 np.random.permutation
to generate random permutation of row indices and then index into the rows of X
using np.take
与 axis=0
一起使用。此外,np.take
有助于使用 out=
选项覆盖输入数组 X
本身,这将节省我们的内存。因此,实现看起来像这样 -
np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
样本运行-
In [23]: X
Out[23]:
array([[ 0.60511059, 0.75001599],
[ 0.30968339, 0.09162172],
[ 0.14673218, 0.09089028],
[ 0.31663128, 0.10000309],
[ 0.0957233 , 0.96210485],
[ 0.56843186, 0.36654023]])
In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);
In [25]: X
Out[25]:
array([[ 0.14673218, 0.09089028],
[ 0.31663128, 0.10000309],
[ 0.30968339, 0.09162172],
[ 0.56843186, 0.36654023],
[ 0.0957233 , 0.96210485],
[ 0.60511059, 0.75001599]])
额外的性能提升
这里有一个使用 np.argsort()
-
加速 np.random.permutation(X.shape[0])
的技巧
np.random.rand(X.shape[0]).argsort()
加速结果 -
In [32]: X = np.random.random((6000, 2000))
In [33]: %timeit np.random.permutation(X.shape[0])
1000 loops, best of 3: 510 µs per loop
In [34]: %timeit np.random.rand(X.shape[0]).argsort()
1000 loops, best of 3: 297 µs per loop
因此,洗牌解决方案可以修改为-
np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
运行时测试 -
这些测试包括 post 中列出的两种方法和 中基于 np.shuffle
的方法。
In [40]: X = np.random.random((6000, 2000))
In [41]: %timeit np.random.shuffle(X)
10 loops, best of 3: 25.2 ms per loop
In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
10 loops, best of 3: 53.3 ms per loop
In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
10 loops, best of 3: 53.2 ms per loop
因此,似乎仅当内存是一个问题时才可以使用这些基于 np.take
的解决方案,否则基于 np.random.shuffle
的解决方案看起来是可行的方法。
经过一些实验 (i) 找到了在 nD 数组中随机排列数据(按行)的最内存和最省时的方法。首先,打乱数组的索引,然后使用打乱后的索引获取数据。例如
rand_num2 = np.random.randint(5, size=(6000, 2000))
perm = np.arange(rand_num2.shape[0])
np.random.shuffle(perm)
rand_num2 = rand_num2[perm]
更多细节
在这里,我使用 memory_profiler 来查找内存使用情况和 python 的内置“时间”模块记录时间并比较所有以前的答案
def main():
# shuffle data itself
rand_num = np.random.randint(5, size=(6000, 2000))
start = time.time()
np.random.shuffle(rand_num)
print('Time for direct shuffle: {0}'.format((time.time() - start)))
# Shuffle index and get data from shuffled index
rand_num2 = np.random.randint(5, size=(6000, 2000))
start = time.time()
perm = np.arange(rand_num2.shape[0])
np.random.shuffle(perm)
rand_num2 = rand_num2[perm]
print('Time for shuffling index: {0}'.format((time.time() - start)))
# using np.take()
rand_num3 = np.random.randint(5, size=(6000, 2000))
start = time.time()
np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
print("Time taken by np.take, {0}".format((time.time() - start)))
时间结果
Time for direct shuffle: 0.03345608711242676 # 33.4msec
Time for shuffling index: 0.019818782806396484 # 19.8msec
Time taken by np.take, 0.06726956367492676 # 67.2msec
内存分析器结果
Line # Mem usage Increment Line Contents
================================================
39 117.422 MiB 0.000 MiB @profile
40 def main():
41 # shuffle data itself
42 208.977 MiB 91.555 MiB rand_num = np.random.randint(5, size=(6000, 2000))
43 208.977 MiB 0.000 MiB start = time.time()
44 208.977 MiB 0.000 MiB np.random.shuffle(rand_num)
45 208.977 MiB 0.000 MiB print('Time for direct shuffle: {0}'.format((time.time() - start)))
46
47 # Shuffle index and get data from shuffled index
48 300.531 MiB 91.555 MiB rand_num2 = np.random.randint(5, size=(6000, 2000))
49 300.531 MiB 0.000 MiB start = time.time()
50 300.535 MiB 0.004 MiB perm = np.arange(rand_num2.shape[0])
51 300.539 MiB 0.004 MiB np.random.shuffle(perm)
52 300.539 MiB 0.000 MiB rand_num2 = rand_num2[perm]
53 300.539 MiB 0.000 MiB print('Time for shuffling index: {0}'.format((time.time() - start)))
54
55 # using np.take()
56 392.094 MiB 91.555 MiB rand_num3 = np.random.randint(5, size=(6000, 2000))
57 392.094 MiB 0.000 MiB start = time.time()
58 392.242 MiB 0.148 MiB np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
59 392.242 MiB 0.000 MiB print("Time taken by np.take, {0}".format((time.time() - start)))
您可以使用 np.vectorize()
函数 A
按行 打乱二维数组:
shuffle = np.vectorize(np.random.permutation, signature='(n)->(n)')
A_shuffled = shuffle(A)
我对此有疑问(或者这就是答案)
假设我们有一个 shape=(1000,60,11,1) 的 numpy 数组 X
还假设 X 是大小为 60x11 且通道数 =1 (60x11x1) 的图像数组。
如果我想打乱所有这些图像的顺序怎么办,为此我将对 X 的索引使用打乱。
def shuffling( X):
indx=np.arange(len(X)) # create a array with indexes for X data
np.random.shuffle(indx)
X=X[indx]
return X
这行得通吗?据我所知,len(X) 将 return 最大尺寸大小。
我尝试了很多解决方案,最后我使用了这个简单的解决方案:
from sklearn.utils import shuffle
x = np.array([[1, 2],
[3, 4],
[5, 6]])
print(shuffle(x, random_state=0))
输出:
[
[5 6]
[3 4]
[1 2]
]
如果你有 3d 数组,循环遍历第一个轴(轴=0)并应用此函数,如:
np.array([shuffle(item) for item in 3D_numpy_array])
如何仅在 Python 中按行打乱多维数组(因此不要打乱列)。
我正在寻找最有效的解决方案,因为我的矩阵非常庞大。是否也可以在原始数组上高效地执行此操作(以节省内存)?
示例:
import numpy as np
X = np.random.random((6, 2))
print(X)
Y = ???shuffle by row only not colls???
print(Y)
我现在期望的是原始矩阵:
[[ 0.48252164 0.12013048]
[ 0.77254355 0.74382174]
[ 0.45174186 0.8782033 ]
[ 0.75623083 0.71763107]
[ 0.26809253 0.75144034]
[ 0.23442518 0.39031414]]
输出随机排列行而不是列,例如:
[[ 0.45174186 0.8782033 ]
[ 0.48252164 0.12013048]
[ 0.77254355 0.74382174]
[ 0.75623083 0.71763107]
[ 0.23442518 0.39031414]
[ 0.26809253 0.75144034]]
您可以使用 numpy.random.shuffle()
.
This function only shuffles the array along the first axis of a multi-dimensional array. The order of sub-arrays is changed but their contents remains the same.
In [2]: import numpy as np
In [3]:
In [3]: X = np.random.random((6, 2))
In [4]: X
Out[4]:
array([[0.71935047, 0.25796155],
[0.4621708 , 0.55140423],
[0.22605866, 0.61581771],
[0.47264172, 0.79307633],
[0.22701656, 0.11927993],
[0.20117207, 0.2754544 ]])
In [5]: np.random.shuffle(X)
In [6]: X
Out[6]:
array([[0.71935047, 0.25796155],
[0.47264172, 0.79307633],
[0.4621708 , 0.55140423],
[0.22701656, 0.11927993],
[0.20117207, 0.2754544 ],
[0.22605866, 0.61581771]])
对于其他功能,您还可以查看以下功能:
在 Numpy 的 1.20.0 版本中引入了函数 random.Generator.permuted
。
The new function differs from
shuffle
andpermutation
in that the subarrays indexed by an axis are permuted rather than the axis being treated as a separate 1-D array for every combination of the other indexes. For example, it is now possible to permute the rows or columns of a 2-D array.
您还可以将 np.random.permutation
to generate random permutation of row indices and then index into the rows of X
using np.take
与 axis=0
一起使用。此外,np.take
有助于使用 out=
选项覆盖输入数组 X
本身,这将节省我们的内存。因此,实现看起来像这样 -
np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
样本运行-
In [23]: X
Out[23]:
array([[ 0.60511059, 0.75001599],
[ 0.30968339, 0.09162172],
[ 0.14673218, 0.09089028],
[ 0.31663128, 0.10000309],
[ 0.0957233 , 0.96210485],
[ 0.56843186, 0.36654023]])
In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);
In [25]: X
Out[25]:
array([[ 0.14673218, 0.09089028],
[ 0.31663128, 0.10000309],
[ 0.30968339, 0.09162172],
[ 0.56843186, 0.36654023],
[ 0.0957233 , 0.96210485],
[ 0.60511059, 0.75001599]])
额外的性能提升
这里有一个使用 np.argsort()
-
np.random.permutation(X.shape[0])
的技巧
np.random.rand(X.shape[0]).argsort()
加速结果 -
In [32]: X = np.random.random((6000, 2000))
In [33]: %timeit np.random.permutation(X.shape[0])
1000 loops, best of 3: 510 µs per loop
In [34]: %timeit np.random.rand(X.shape[0]).argsort()
1000 loops, best of 3: 297 µs per loop
因此,洗牌解决方案可以修改为-
np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
运行时测试 -
这些测试包括 post 中列出的两种方法和 np.shuffle
的方法。
In [40]: X = np.random.random((6000, 2000))
In [41]: %timeit np.random.shuffle(X)
10 loops, best of 3: 25.2 ms per loop
In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
10 loops, best of 3: 53.3 ms per loop
In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
10 loops, best of 3: 53.2 ms per loop
因此,似乎仅当内存是一个问题时才可以使用这些基于 np.take
的解决方案,否则基于 np.random.shuffle
的解决方案看起来是可行的方法。
经过一些实验 (i) 找到了在 nD 数组中随机排列数据(按行)的最内存和最省时的方法。首先,打乱数组的索引,然后使用打乱后的索引获取数据。例如
rand_num2 = np.random.randint(5, size=(6000, 2000))
perm = np.arange(rand_num2.shape[0])
np.random.shuffle(perm)
rand_num2 = rand_num2[perm]
更多细节
在这里,我使用 memory_profiler 来查找内存使用情况和 python 的内置“时间”模块记录时间并比较所有以前的答案
def main():
# shuffle data itself
rand_num = np.random.randint(5, size=(6000, 2000))
start = time.time()
np.random.shuffle(rand_num)
print('Time for direct shuffle: {0}'.format((time.time() - start)))
# Shuffle index and get data from shuffled index
rand_num2 = np.random.randint(5, size=(6000, 2000))
start = time.time()
perm = np.arange(rand_num2.shape[0])
np.random.shuffle(perm)
rand_num2 = rand_num2[perm]
print('Time for shuffling index: {0}'.format((time.time() - start)))
# using np.take()
rand_num3 = np.random.randint(5, size=(6000, 2000))
start = time.time()
np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
print("Time taken by np.take, {0}".format((time.time() - start)))
时间结果
Time for direct shuffle: 0.03345608711242676 # 33.4msec
Time for shuffling index: 0.019818782806396484 # 19.8msec
Time taken by np.take, 0.06726956367492676 # 67.2msec
内存分析器结果
Line # Mem usage Increment Line Contents
================================================
39 117.422 MiB 0.000 MiB @profile
40 def main():
41 # shuffle data itself
42 208.977 MiB 91.555 MiB rand_num = np.random.randint(5, size=(6000, 2000))
43 208.977 MiB 0.000 MiB start = time.time()
44 208.977 MiB 0.000 MiB np.random.shuffle(rand_num)
45 208.977 MiB 0.000 MiB print('Time for direct shuffle: {0}'.format((time.time() - start)))
46
47 # Shuffle index and get data from shuffled index
48 300.531 MiB 91.555 MiB rand_num2 = np.random.randint(5, size=(6000, 2000))
49 300.531 MiB 0.000 MiB start = time.time()
50 300.535 MiB 0.004 MiB perm = np.arange(rand_num2.shape[0])
51 300.539 MiB 0.004 MiB np.random.shuffle(perm)
52 300.539 MiB 0.000 MiB rand_num2 = rand_num2[perm]
53 300.539 MiB 0.000 MiB print('Time for shuffling index: {0}'.format((time.time() - start)))
54
55 # using np.take()
56 392.094 MiB 91.555 MiB rand_num3 = np.random.randint(5, size=(6000, 2000))
57 392.094 MiB 0.000 MiB start = time.time()
58 392.242 MiB 0.148 MiB np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
59 392.242 MiB 0.000 MiB print("Time taken by np.take, {0}".format((time.time() - start)))
您可以使用 np.vectorize()
函数 A
按行 打乱二维数组:
shuffle = np.vectorize(np.random.permutation, signature='(n)->(n)')
A_shuffled = shuffle(A)
我对此有疑问(或者这就是答案) 假设我们有一个 shape=(1000,60,11,1) 的 numpy 数组 X 还假设 X 是大小为 60x11 且通道数 =1 (60x11x1) 的图像数组。
如果我想打乱所有这些图像的顺序怎么办,为此我将对 X 的索引使用打乱。
def shuffling( X):
indx=np.arange(len(X)) # create a array with indexes for X data
np.random.shuffle(indx)
X=X[indx]
return X
这行得通吗?据我所知,len(X) 将 return 最大尺寸大小。
我尝试了很多解决方案,最后我使用了这个简单的解决方案:
from sklearn.utils import shuffle
x = np.array([[1, 2],
[3, 4],
[5, 6]])
print(shuffle(x, random_state=0))
输出:
[
[5 6]
[3 4]
[1 2]
]
如果你有 3d 数组,循环遍历第一个轴(轴=0)并应用此函数,如:
np.array([shuffle(item) for item in 3D_numpy_array])