pandas 仅替换列的一部分
pandas replace only part of a column
这是我的输入:
import pandas as pd
import numpy as np
list1 = [10,79,6,38,4,557,12,220,46,22,45,22]
list2 = [4,3,23,6,234,47,312,2,426,42,435,23]
df = pd.DataFrame({'A' : list1, 'B' : list2}, columns = ['A', 'B'])
df['C'] = np.where (df['A'] > df['B'].shift(-2), 1, np.nan)
print (df)
产生此输出:
A B C
0 10 4 NaN
1 79 3 1.0
2 6 23 NaN
3 38 6 NaN
4 4 234 NaN
5 557 47 1.0
6 12 312 NaN
7 220 2 1.0
8 46 426 NaN
9 22 42 NaN
10 45 435 NaN
11 22 23 NaN
我需要做的是将列 'C' 更改为一组三个连续的 1,不重叠。期望的输出是:
A B C
0 10 4 NaN
1 79 3 1.0
2 6 23 1.0
3 38 6 1.0
4 4 234 NaN
5 557 47 1.0
6 12 312 1.0
7 220 2 1.0
8 46 426 NaN
9 22 42 NaN
10 45 435 NaN
11 22 23 NaN
因此,第 2、3 和 6 行从 NaN 变为 1.0。第 7 行已经有一个 1.0,它被忽略了。第 8 行和第 9 行需要保留 NaN,因为第 7 行是上一组的最后一个条目。
我不知道是否有更好的方法来构建列 'C' 在创建时执行此操作。
我尝试了几个版本的 fillna 和 ffill,none 对我有用。
这看起来很复杂,但我试图用这一行来隔离每个 1.0 的行 ID:
print (df.loc[df['C'] == 1])
哪个正确输出:
A B C
1 79 3 1.0
5 557 47 1.0
7 220 2 1.0
虽然我知道这些信息,但我不知道如何从那里继续。
非常感谢您的提前帮助,
大卫
list1 = [10,79,6,38,4,557,12,220,46,22,45,22]
list2 = [4,3,23,6,234,47,312,2,426,42,435,23]
df = pd.DataFrame({'A' : list1, 'B' : list2}, columns = ['A', 'B'])
df['C'] = np.where (df['A'] > df['B'].shift(-2), 1, np.nan)
A B C
0 10 4 NaN
1 79 3 1.0
2 6 23 NaN
3 38 6 NaN
4 4 234 NaN
5 557 47 1.0
6 12 312 NaN
7 220 2 1.0
8 46 426 NaN
9 22 42 NaN
10 45 435 NaN
11 22 23 NaN
从序列创建一个数组:
a = np.array(df.C)
此函数将测试数组的片段是否匹配模式,并将替换与另一个模式匹配的片段。以前匹配的片段将不会被考虑用于未来的匹配(填充数字大于 1)。
def fill_segments(a, test_patterns, fill_patterns):
# replace nans with zeros so fast numpy array_equal will work
nan_idx = np.where(np.isnan(a))[0]
np.put(a, nan_idx, 0.)
col_index = list(np.arange(a.size))
# loop forward through sequence comparing segment patterns
for j in np.arange(len(test_patterns)):
this_pattern = test_patterns[j]
snip = len(this_pattern)
rng = col_index[:-snip + 1]
for i in rng:
seg = a[col_index[i: i + snip]]
if np.array_equal(seg, this_pattern):
# when a match is found, replace values in array segment
# with fill pattern
pattern_indexes = col_index[i: i + snip]
np.put(a, pattern_indexes, fill_patterns[j])
# convert all fillers to ones
np.put(a, np.where(a > 1.)[0], 1.)
# convert zeros back to nans
np.put(a, np.where(a == 0.)[0], np.nan)
return a
要替换的模式:
p1 = [1., 1., 1.]
p2 = [1., 0., 1.]
p3 = [1., 1., 0.]
p4 = [1., 0., 0.]
以及相应的填充图案:
f1 = [5., 5., 5.]
f2 = [4., 4., 4.]
f3 = [3., 3., 3.]
f4 = [2., 2., 2.]
进行 test_patterns 和 fill_patterns 输入
patterns = [p1, p2, p3, p4]
fills = [f1, f2, f3, f4]
运行函数:
a = fill_segments(a, patterns, fills)
将 a 分配给 C 列
df.C = a
df:
A B C
0 10 4 NaN
1 79 3 1.0
2 6 23 1.0
3 38 6 1.0
4 4 234 NaN
5 557 47 1.0
6 12 312 1.0
7 220 2 1.0
8 46 426 NaN
9 22 42 NaN
10 45 435 NaN
11 22 23 NaN
模式和填充可能需要 adjusted/added 取决于输入列最初填充的方式和特定的结果序列规则。
编辑:
更快的版本(感谢 b2002):
ii = df[pd.notnull(df.C)].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj
for ci in jj:
df.C.values[ci:ci+3] = 1.0
首先通过查看 C
列中不为空的点之间的差异(第一个默认包含索引),然后迭代这些索引并使用 loc
更改 C
列的切片:
ii = df[pd.notnull(df.C)].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj
for ci in jj:
df.loc[ci:ci+2,'C'] = 1.0
结果:
A B C
0 10 4 NaN
1 79 3 1.0
2 6 23 1.0
3 38 6 1.0
4 4 234 NaN
5 557 47 1.0
6 12 312 1.0
7 220 2 1.0
8 46 426 NaN
9 22 42 NaN
10 45 435 NaN
11 22 23 NaN
这是我的输入:
import pandas as pd
import numpy as np
list1 = [10,79,6,38,4,557,12,220,46,22,45,22]
list2 = [4,3,23,6,234,47,312,2,426,42,435,23]
df = pd.DataFrame({'A' : list1, 'B' : list2}, columns = ['A', 'B'])
df['C'] = np.where (df['A'] > df['B'].shift(-2), 1, np.nan)
print (df)
产生此输出:
A B C
0 10 4 NaN
1 79 3 1.0
2 6 23 NaN
3 38 6 NaN
4 4 234 NaN
5 557 47 1.0
6 12 312 NaN
7 220 2 1.0
8 46 426 NaN
9 22 42 NaN
10 45 435 NaN
11 22 23 NaN
我需要做的是将列 'C' 更改为一组三个连续的 1,不重叠。期望的输出是:
A B C
0 10 4 NaN
1 79 3 1.0
2 6 23 1.0
3 38 6 1.0
4 4 234 NaN
5 557 47 1.0
6 12 312 1.0
7 220 2 1.0
8 46 426 NaN
9 22 42 NaN
10 45 435 NaN
11 22 23 NaN
因此,第 2、3 和 6 行从 NaN 变为 1.0。第 7 行已经有一个 1.0,它被忽略了。第 8 行和第 9 行需要保留 NaN,因为第 7 行是上一组的最后一个条目。
我不知道是否有更好的方法来构建列 'C' 在创建时执行此操作。
我尝试了几个版本的 fillna 和 ffill,none 对我有用。
这看起来很复杂,但我试图用这一行来隔离每个 1.0 的行 ID:
print (df.loc[df['C'] == 1])
哪个正确输出:
A B C
1 79 3 1.0
5 557 47 1.0
7 220 2 1.0
虽然我知道这些信息,但我不知道如何从那里继续。
非常感谢您的提前帮助, 大卫
list1 = [10,79,6,38,4,557,12,220,46,22,45,22]
list2 = [4,3,23,6,234,47,312,2,426,42,435,23]
df = pd.DataFrame({'A' : list1, 'B' : list2}, columns = ['A', 'B'])
df['C'] = np.where (df['A'] > df['B'].shift(-2), 1, np.nan)
A B C
0 10 4 NaN
1 79 3 1.0
2 6 23 NaN
3 38 6 NaN
4 4 234 NaN
5 557 47 1.0
6 12 312 NaN
7 220 2 1.0
8 46 426 NaN
9 22 42 NaN
10 45 435 NaN
11 22 23 NaN
从序列创建一个数组:
a = np.array(df.C)
此函数将测试数组的片段是否匹配模式,并将替换与另一个模式匹配的片段。以前匹配的片段将不会被考虑用于未来的匹配(填充数字大于 1)。
def fill_segments(a, test_patterns, fill_patterns):
# replace nans with zeros so fast numpy array_equal will work
nan_idx = np.where(np.isnan(a))[0]
np.put(a, nan_idx, 0.)
col_index = list(np.arange(a.size))
# loop forward through sequence comparing segment patterns
for j in np.arange(len(test_patterns)):
this_pattern = test_patterns[j]
snip = len(this_pattern)
rng = col_index[:-snip + 1]
for i in rng:
seg = a[col_index[i: i + snip]]
if np.array_equal(seg, this_pattern):
# when a match is found, replace values in array segment
# with fill pattern
pattern_indexes = col_index[i: i + snip]
np.put(a, pattern_indexes, fill_patterns[j])
# convert all fillers to ones
np.put(a, np.where(a > 1.)[0], 1.)
# convert zeros back to nans
np.put(a, np.where(a == 0.)[0], np.nan)
return a
要替换的模式:
p1 = [1., 1., 1.]
p2 = [1., 0., 1.]
p3 = [1., 1., 0.]
p4 = [1., 0., 0.]
以及相应的填充图案:
f1 = [5., 5., 5.]
f2 = [4., 4., 4.]
f3 = [3., 3., 3.]
f4 = [2., 2., 2.]
进行 test_patterns 和 fill_patterns 输入
patterns = [p1, p2, p3, p4]
fills = [f1, f2, f3, f4]
运行函数:
a = fill_segments(a, patterns, fills)
将 a 分配给 C 列
df.C = a
df:
A B C
0 10 4 NaN
1 79 3 1.0
2 6 23 1.0
3 38 6 1.0
4 4 234 NaN
5 557 47 1.0
6 12 312 1.0
7 220 2 1.0
8 46 426 NaN
9 22 42 NaN
10 45 435 NaN
11 22 23 NaN
模式和填充可能需要 adjusted/added 取决于输入列最初填充的方式和特定的结果序列规则。
编辑:
更快的版本(感谢 b2002):
ii = df[pd.notnull(df.C)].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj
for ci in jj:
df.C.values[ci:ci+3] = 1.0
首先通过查看 C
列中不为空的点之间的差异(第一个默认包含索引),然后迭代这些索引并使用 loc
更改 C
列的切片:
ii = df[pd.notnull(df.C)].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj
for ci in jj:
df.loc[ci:ci+2,'C'] = 1.0
结果:
A B C
0 10 4 NaN
1 79 3 1.0
2 6 23 1.0
3 38 6 1.0
4 4 234 NaN
5 557 47 1.0
6 12 312 1.0
7 220 2 1.0
8 46 426 NaN
9 22 42 NaN
10 45 435 NaN
11 22 23 NaN