根据条件向前填充列
Forward fill column on condition
我的数据框是这样的;
df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})
如果 col1 在第 2 列中包含值 1,我想向前填充 1 n 次。例如,如果 n = 4 那么我需要的结果看起来像这样。
df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
,'Col2':[0,1,1,1,1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1]})
我想我可以使用带有计数器的 for 循环来做到这一点,该计数器在每次条件发生时都会重置,但是有没有更快的方法来产生相同的结果?
谢谢!
对于一般解决方案,用 Series.where
将非 1
值替换为缺失值,并使用 limit 参数向前填充 1
值,最后用原始值替换缺失值:
n = 3
df['Col2'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)
print (df)
Col1 Col2
0 0 0
1 1 1
2 0 1
3 0 1
4 0 1
5 0 0
6 0 0
7 0 0
8 1 1
9 0 1
10 0 1
11 0 1
12 0 0
13 0 0
14 0 0
15 0 0
16 0 0
17 1 1
18 0 1
19 0 1
20 0 1
这是一个基于 NumPy 的方法,使用 np.flatnonzero
获取 Col1
为 1 的索引,并采用范围为 n
的广播 sum
:
n = 4
ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
df.loc[ix, 'Col2'] = 1
print(df)
Col1 Col2
0 0 0
1 1 1
2 0 1
3 0 1
4 0 1
5 0 0
6 0 0
7 0 0
8 1 1
9 0 1
10 0 1
11 0 1
12 0 0
13 0 0
14 0 0
15 0 0
16 0 0
17 1 1
18 0 1
19 0 1
20 0 1
方法 #1: 基于 NumPy 的 1D convolution
-
N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')
更紧凑的一行 -
df['Col2'] = (np.convolve(df.Col1,[1]*N)[:-N+1]>0).view('i1')
方法 #2: 这是一个 SciPy's binary_dilation
-
from scipy.ndimage.morphology import binary_dilation
N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')
方法 #3: 使用基于跨步视图的工具从 NumPy 中挤出最好的东西 -
from skimage.util.shape import view_as_windows
N = 4 # window size
mask = df.Col1.values==1
w = view_as_windows(mask,N)
idx = len(df)-(N-mask[-N:].argmax())
if mask[-N:].any():
mask[idx:idx+N-1] = 1
w[mask[:-N+1]] = 1
df['Col2'] = mask.view('i1')
基准测试
给定样本的设置按 10,000x
-
放大
In [67]: df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
...: ,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})
...:
...: df = pd.concat([df]*10000)
...: df.index = range(len(df.index))
计时
# @jezrael's soln
In [68]: %%timeit
...: n = 3
...: df['Col2_1'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)
5.15 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# App-1 from this post
In [72]: %%timeit
...: N = 4 # window size
...: K = np.ones(N,dtype=bool)
...: df['Col2_2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')
1.41 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# App-2 from this post
In [70]: %%timeit
...: N = 4 # window size
...: K = np.ones(N,dtype=bool)
...: df['Col2_3'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')
2.92 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# App-3 from this post
In [35]: %%timeit
...: N = 4 # window size
...: mask = df.Col1.values==1
...: w = view_as_windows(mask,N)
...: idx = len(df)-(N-mask[-N:].argmax())
...: if mask[-N:].any():
...: mask[idx:idx+N-1] = 1
...: w[mask[:-N+1]] = 1
...: df['Col2_4'] = mask.view('i1')
1.22 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# @yatu's soln
In [71]: %%timeit
...: n = 4
...: ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
...: df.loc[ix, 'Col2_5'] = 1
7.55 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
reindex
N=4
s=df.loc[df.Col1==1,'Col1']
idx=s.index
s=s.reindex(idx.repeat(N))
s.index=(idx.values+np.arange(N)[:,None]).ravel('F')
df.Col2.update(s)
df
Col1 Col2
0 0 0
1 1 1
2 0 1
3 0 1
4 0 1
5 0 0
6 0 0
7 0 0
8 1 1
9 0 1
10 0 1
11 0 1
12 0 0
13 0 0
14 0 0
15 0 0
16 0 0
17 1 1
18 0 1
19 0 1
20 0 1
我的数据框是这样的;
df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})
如果 col1 在第 2 列中包含值 1,我想向前填充 1 n 次。例如,如果 n = 4 那么我需要的结果看起来像这样。
df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
,'Col2':[0,1,1,1,1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1]})
我想我可以使用带有计数器的 for 循环来做到这一点,该计数器在每次条件发生时都会重置,但是有没有更快的方法来产生相同的结果?
谢谢!
对于一般解决方案,用 Series.where
将非 1
值替换为缺失值,并使用 limit 参数向前填充 1
值,最后用原始值替换缺失值:
n = 3
df['Col2'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)
print (df)
Col1 Col2
0 0 0
1 1 1
2 0 1
3 0 1
4 0 1
5 0 0
6 0 0
7 0 0
8 1 1
9 0 1
10 0 1
11 0 1
12 0 0
13 0 0
14 0 0
15 0 0
16 0 0
17 1 1
18 0 1
19 0 1
20 0 1
这是一个基于 NumPy 的方法,使用 np.flatnonzero
获取 Col1
为 1 的索引,并采用范围为 n
的广播 sum
:
n = 4
ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
df.loc[ix, 'Col2'] = 1
print(df)
Col1 Col2
0 0 0
1 1 1
2 0 1
3 0 1
4 0 1
5 0 0
6 0 0
7 0 0
8 1 1
9 0 1
10 0 1
11 0 1
12 0 0
13 0 0
14 0 0
15 0 0
16 0 0
17 1 1
18 0 1
19 0 1
20 0 1
方法 #1: 基于 NumPy 的 1D convolution
-
N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')
更紧凑的一行 -
df['Col2'] = (np.convolve(df.Col1,[1]*N)[:-N+1]>0).view('i1')
方法 #2: 这是一个 SciPy's binary_dilation
-
from scipy.ndimage.morphology import binary_dilation
N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')
方法 #3: 使用基于跨步视图的工具从 NumPy 中挤出最好的东西 -
from skimage.util.shape import view_as_windows
N = 4 # window size
mask = df.Col1.values==1
w = view_as_windows(mask,N)
idx = len(df)-(N-mask[-N:].argmax())
if mask[-N:].any():
mask[idx:idx+N-1] = 1
w[mask[:-N+1]] = 1
df['Col2'] = mask.view('i1')
基准测试
给定样本的设置按 10,000x
-
In [67]: df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
...: ,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})
...:
...: df = pd.concat([df]*10000)
...: df.index = range(len(df.index))
计时
# @jezrael's soln
In [68]: %%timeit
...: n = 3
...: df['Col2_1'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)
5.15 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# App-1 from this post
In [72]: %%timeit
...: N = 4 # window size
...: K = np.ones(N,dtype=bool)
...: df['Col2_2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')
1.41 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# App-2 from this post
In [70]: %%timeit
...: N = 4 # window size
...: K = np.ones(N,dtype=bool)
...: df['Col2_3'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')
2.92 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# App-3 from this post
In [35]: %%timeit
...: N = 4 # window size
...: mask = df.Col1.values==1
...: w = view_as_windows(mask,N)
...: idx = len(df)-(N-mask[-N:].argmax())
...: if mask[-N:].any():
...: mask[idx:idx+N-1] = 1
...: w[mask[:-N+1]] = 1
...: df['Col2_4'] = mask.view('i1')
1.22 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# @yatu's soln
In [71]: %%timeit
...: n = 4
...: ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
...: df.loc[ix, 'Col2_5'] = 1
7.55 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
reindex
N=4
s=df.loc[df.Col1==1,'Col1']
idx=s.index
s=s.reindex(idx.repeat(N))
s.index=(idx.values+np.arange(N)[:,None]).ravel('F')
df.Col2.update(s)
df
Col1 Col2
0 0 0
1 1 1
2 0 1
3 0 1
4 0 1
5 0 0
6 0 0
7 0 0
8 1 1
9 0 1
10 0 1
11 0 1
12 0 0
13 0 0
14 0 0
15 0 0
16 0 0
17 1 1
18 0 1
19 0 1
20 0 1