仅使用 Numpy（无 pandas）的 Groupby 移位（滞后值）模拟

Question

我有一个如下所示的数据框：

          id    date       v1
0          0  1983.0    1.574
1          0  1984.0    1.806
2          0  1985.0    4.724
3          1  1986.0    0.320
4          1  1987.0    3.414
     ...     ...      ...
107191  9874  1993.0   52.448
107192  9874  1994.0  108.652
107193  9875  1992.0    1.597
107194  9875  1993.0    3.134
107195  9875  1994.0    7.619

我想生成一个新列，其中 v1 的滞后值按 id 排序。在 pandas 我会使用

df.groupby('id')['v1'].shift(-1)

但是，我想仅使用 Numpy 以纯 matrix/array 形式翻译它。在 Numpy 中获得类似物的最直接方法是什么？我需要避免使用 pandas 工具，因为我想稍后使用 Numba @jit。

Answer 1

IIUC，您希望纯粹在 numpy 中实现 df.groupby('id')['v1'].shift(-1)。这是由一个grouper和一个shift方法组成。

对于具有第一个分组列和第二个值列的二维数组，numpy 中的 groupby() 等效项是 -

np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])

一维数组在 numpy 中的 shift() 等效项是 -

np.append(np.roll(arr,-1)[:-1], np.nan)

把这2个放在一起就可以得到你想要的-

#2D array with only id and v1 as columns
arr = df[['id','v1']].values   

#Groupby based on id
grouper = np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:]) 

#apply shift to grouped elements
shift = [np.append(np.roll(i,-1)[:-1], np.nan) for i in grouper] 

#stack them as a single array
new_col = np.hstack(shift) 

#set as column
df['shifted'] = new_col

测试虚拟数据 -

#Dummy data
idx = [0,0,0,0,0,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3]
val = np.arange(len(idx))
arr = np.array([idx, val]).T
df = pd.DataFrame(arr, columns=['id','v1'])

#apply grouped shifting
arr = df[['id','v1']].values
grouper = np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])
shift = [np.append(np.roll(i,-1)[:-1], np.nan) for i in grouper]
new_col = np.hstack(shift)
df['shifted'] = new_col

print(df)

    id  v1  shifted
0    0   0      1.0
1    0   1      2.0
2    0   2      3.0
3    0   3      4.0
4    0   4      NaN
5    1   5      6.0
6    1   6      7.0
7    1   7      8.0
8    1   8      9.0
9    1   9     10.0
10   1  10      NaN
11   2  11     12.0
12   2  12     13.0
13   2  13     14.0
14   2  14      NaN
15   3  15     16.0
16   3  16     17.0
17   3  17     18.0
18   3  18     19.0
19   3  19      NaN

仅使用 Numpy（无 pandas）的 Groupby 移位（滞后值）模拟

Groupby shift (lagged values) analogue with only Numpy (no pandas)

group-by

numpy

shift

lag

pandas

测试虚拟数据 -