如何仅在四肢具有相同值且限制为最大出现次数时才填补数据空白?
How to fill data gaps only when extremities have the same value, and limited to a maximum of occurrences?
我在这里搜索了很多可以解决这个问题但找不到的答案。期望的结果是仅在四肢值相等时填充空隙,长度限制为 4 个值:
我的数据集:
0 NaN
1 NaN
2 NaN
3 5.0
4 5.0
5 NaN
6 NaN
7 5.0
8 6.0
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 5.0
16 5.0
17 NaN
18 NaN
19 6.0
20 6.0
21 NaN
22 NaN
23 NaN
24 NaN
25 5.0
26 NaN
27 NaN
28 NaN
29 NaN
30 NaN
31 NaN
32 NaN
33 5.0
34 NaN
35 NaN
期望的结果(仅在四肢值相等时填充空隙,限制为长度为 4 的空隙):
0 NaN # Not filled since the gap ends with 5 but this is the dataset beginning (don't know how it starts)
1 NaN # Not filled since the gap ends with 5 but this is the dataset beginning (don't know how it starts)
2 NaN # Not filled since the gap ends with 5 but this is the dataset beginning (don't know how it starts)
3 5.0 # Original dataset
4 5.0 # Original dataset
5 5.0 # Filled since the gap starts with 5 and ends with 5 (and is smaller than 4 values)
6 5.0 # Filled since the gap starts with 5 and ends with 5 (and is smaller than 4 values)
7 5.0 # Original dataset
8 6.0 # Original dataset
9 NaN # Not filled since the gap starts with 6 and ends with 5
10 NaN .
11 NaN .
12 NaN .
13 NaN .
14 NaN # Not filled since the gap starts with 6 and ends with 5
15 5.0 # Original dataset
16 5.0 # Original dataset
17 NaN # Not filled since the gap starts with 5 and ends with 6
18 NaN # Not filled since the gap starts with 5 and ends with 6
19 6.0 # Original dataset
20 6.0 # Original dataset
21 NaN # Not filled since the gap starts with 6 and ends with 5
22 NaN .
23 NaN .
24 NaN # Not filled since the gap starts with 6 and ends with 5
25 5.0 # Original dataset
26 5.0 # Filled since the gap starts with 5 and ends with 5
27 5.0 # Filled since the gap starts with 5 and ends with 5
28 5.0 # Filled since the gap starts with 5 and ends with 5
29 5.0 # Filled since the gap starts with 5 and ends with 5
30 NaN # Not filled since maximum gap is 4
31 NaN # Not filled since maximum gap is 4
32 NaN # Not filled since maximum gap is 4
33 5.0 # Original dataset
34 NaN # Not filled since the gap starts with 5 but this is the dataset end (don't know how it ends)
35 NaN # Not filled since the gap starts with 5 but this is the dataset end (don't know how it ends)
应该是这样的:
def extremities(arr):
nones = [i for i,x in enumerate(arr) if x == None]
not_nones = [i for i,x in enumerate(arr) if x != None]
for i in nones:
try:
start = [x for x in not_nones if x < i][-1]
finish = [x for x in not_nones if x > i][0]
except:
continue
if arr[start] == arr[finish] and i - start < 5:
arr[i] = arr[start]
return arr
已编辑:
抱歉,我忘了它的长度限制为 4 个值。我编辑了代码。
我们可以使用布尔掩码和 cumsum
来识别以相同值开始和结束的 NaN
值块,然后将这些块上的列分组并向前填充限制为 4
s = df['col']
m = s.notna()
s.mask(s[m] != s[m].shift(-1)).groupby(m.cumsum()).ffill(limit=4).fillna(s)
0 NaN
1 NaN
2 NaN
3 5.0
4 5.0
5 5.0
6 5.0
7 5.0
8 6.0
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 5.0
16 5.0
17 NaN
18 NaN
19 6.0
20 6.0
21 NaN
22 NaN
23 NaN
24 NaN
25 5.0
26 5.0
27 5.0
28 5.0
29 5.0
30 NaN
31 NaN
32 NaN
33 5.0
34 NaN
35 NaN
Name: col, dtype: float64
我在这里搜索了很多可以解决这个问题但找不到的答案。期望的结果是仅在四肢值相等时填充空隙,长度限制为 4 个值:
我的数据集:
0 NaN
1 NaN
2 NaN
3 5.0
4 5.0
5 NaN
6 NaN
7 5.0
8 6.0
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 5.0
16 5.0
17 NaN
18 NaN
19 6.0
20 6.0
21 NaN
22 NaN
23 NaN
24 NaN
25 5.0
26 NaN
27 NaN
28 NaN
29 NaN
30 NaN
31 NaN
32 NaN
33 5.0
34 NaN
35 NaN
期望的结果(仅在四肢值相等时填充空隙,限制为长度为 4 的空隙):
0 NaN # Not filled since the gap ends with 5 but this is the dataset beginning (don't know how it starts)
1 NaN # Not filled since the gap ends with 5 but this is the dataset beginning (don't know how it starts)
2 NaN # Not filled since the gap ends with 5 but this is the dataset beginning (don't know how it starts)
3 5.0 # Original dataset
4 5.0 # Original dataset
5 5.0 # Filled since the gap starts with 5 and ends with 5 (and is smaller than 4 values)
6 5.0 # Filled since the gap starts with 5 and ends with 5 (and is smaller than 4 values)
7 5.0 # Original dataset
8 6.0 # Original dataset
9 NaN # Not filled since the gap starts with 6 and ends with 5
10 NaN .
11 NaN .
12 NaN .
13 NaN .
14 NaN # Not filled since the gap starts with 6 and ends with 5
15 5.0 # Original dataset
16 5.0 # Original dataset
17 NaN # Not filled since the gap starts with 5 and ends with 6
18 NaN # Not filled since the gap starts with 5 and ends with 6
19 6.0 # Original dataset
20 6.0 # Original dataset
21 NaN # Not filled since the gap starts with 6 and ends with 5
22 NaN .
23 NaN .
24 NaN # Not filled since the gap starts with 6 and ends with 5
25 5.0 # Original dataset
26 5.0 # Filled since the gap starts with 5 and ends with 5
27 5.0 # Filled since the gap starts with 5 and ends with 5
28 5.0 # Filled since the gap starts with 5 and ends with 5
29 5.0 # Filled since the gap starts with 5 and ends with 5
30 NaN # Not filled since maximum gap is 4
31 NaN # Not filled since maximum gap is 4
32 NaN # Not filled since maximum gap is 4
33 5.0 # Original dataset
34 NaN # Not filled since the gap starts with 5 but this is the dataset end (don't know how it ends)
35 NaN # Not filled since the gap starts with 5 but this is the dataset end (don't know how it ends)
应该是这样的:
def extremities(arr):
nones = [i for i,x in enumerate(arr) if x == None]
not_nones = [i for i,x in enumerate(arr) if x != None]
for i in nones:
try:
start = [x for x in not_nones if x < i][-1]
finish = [x for x in not_nones if x > i][0]
except:
continue
if arr[start] == arr[finish] and i - start < 5:
arr[i] = arr[start]
return arr
已编辑:
抱歉,我忘了它的长度限制为 4 个值。我编辑了代码。
我们可以使用布尔掩码和 cumsum
来识别以相同值开始和结束的 NaN
值块,然后将这些块上的列分组并向前填充限制为 4
s = df['col']
m = s.notna()
s.mask(s[m] != s[m].shift(-1)).groupby(m.cumsum()).ffill(limit=4).fillna(s)
0 NaN
1 NaN
2 NaN
3 5.0
4 5.0
5 5.0
6 5.0
7 5.0
8 6.0
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 5.0
16 5.0
17 NaN
18 NaN
19 6.0
20 6.0
21 NaN
22 NaN
23 NaN
24 NaN
25 5.0
26 5.0
27 5.0
28 5.0
29 5.0
30 NaN
31 NaN
32 NaN
33 5.0
34 NaN
35 NaN
Name: col, dtype: float64