Pandas dataframe 随机打乱分组中的一些列值

Question

我想打乱一些列值，但只在特定组内打乱，并且只打乱组内特定百分比的行。例如，对于每组，我想将 b 列中 n% 的值相互混洗。

df = pd.DataFrame({'grouper_col':[1,1,2,3,3,3,3,4,4], 'b':[12, 13, 16, 21, 14, 11, 12, 13, 15]})

   grouper_col   b
0            1  12
1            1  13
2            2  16
3            3  21
4            3  14
5            3  11
6            3  12
7            4  13
8            4  15

示例输出：

   grouper_col   b
0            1  13
1            1  12
2            2  16
3            3  21
4            3  11
5            3  14
6            3  12
7            4  15
8            4  13

我找到了

df.groupby("grouper_col")["b"].transform(np.random.permutation)

但是我无法控制随机值的百分比。

感谢您的任何提示！

Answer 1

您可以使用 numpy 创建这样的函数（它需要一个 numpy 数组作为输入）

import numpy as np

def shuffle_portion(arr, percentage): 
    shuf = np.random.choice(np.arange(arr.shape[0]),  
                            round(arr.shape[0]*percentage/100), 
                            replace=False) 
    arr[np.sort(shuf)] = arr[shuf] 
    return arr

np.random.choice 将选择一组具有您需要的大小的索引。然后可以按打乱顺序重新排列给定数组中的相应值。现在这应该从 cloumn 'b'

的 9 个值中洗牌 3 个值

df['b'] = shuffle_portion(df['b'].values, 33)

编辑：要与 apply 一起使用，您需要将传递的数据帧转换为函数内部的数组（在注释中解释）并创建 return 数据帧以及

def shuffle_portion(_df, percentage=50): 
    arr = _df['b'].values
    shuf = np.random.choice(np.arange(arr.shape[0]),  
                            round(arr.shape[0]*percentage/100), 
                            replace=False) 
    arr[np.sort(shuf)] = arr[shuf] 
    _df['b'] = arr
    return _df

现在你可以做

df.groupby("grouper_col", as_index=False).apply(shuffle_portion)

最好将需要随机播放的列的名称传递给函数 (def shuffle_portion(_df, col='b', percentage=50): arr = _df[col].values ...)

Pandas dataframe 随机打乱分组中的一些列值

Pandas dataframe randomly shuffle some column values in groups

python

shuffle

permutation

pandas

pandas-groupby