根据条件、groupby 和仅某些行,用以前的行值填充行
Fill in rows with former row values, based on conditons, groupby and certain rows only
我有这个数据集
df = pd.DataFrame({'user': {0: 848, 1: 848, 2: 848, 3: 848, 4: 848, 5: 848, 6: 848, 7: 848, 8: 848, 9: 848, 10: 848, 11: 848, 12: 848, 13: 848}, \
'date': {0: '2005-02-05', 1: '2006-10-25', 2: '2006-11-07', 3: '2006-11-20', 4: '2006-12-04', 5: '2006-12-21', 6: '2007-01-08', 7: '2007-02-08', 8: '2007-03-08', 9: '2007-04-10', 10: '2007-11-28', 11: '2007-12-10', 12: '2009-01-07', 13: '2009-01-12'},\
'need_data': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 1, 11: 0, 12: 1, 13: 0}, \
'vt': {0: 34.0, 1: 49.25, 2: 49.25, 3: 0.0, 4: 49.4, 5: 0.0, 6: 0.0, 7: 49.8, 8: 0.0, 9: 50.1, 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0}, \
})
我需要新列 ['feed1'],条件如下:
如果 need_data 列等于 1,因此 [vt] 列的值为 0,那么我需要从前一个条目中获取值作为 ['feed1'] 列的值(在同一用户列中) (不同于 0)column[vt].
期望的输出如下:
df = pd.DataFrame( {'user': {0: 848, 1: 848, 2: 848, 3: 848, 4: 848, 5: 848, 6: 848, 7: 848, 8: 848, 9: 848, 10: 848, 11: 848, 12: 848, 13: 848}, 'date': {0: '2005-02-05', 1: '2006-10-25', 2: '2006-11-07', 3: '2006-11-20', 4: '2006-12-04', 5: '2006-12-21', 6: '2007-01-08', 7: '2007-02-08', 8: '2007-03-08', 9: '2007-04-10', 10: '2007-11-28', 11: '2007-12-10', 12: '2009-01-07', 13: '2009-01-12'}, 'need_data': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 1, 11: 0, 12: 1, 13: 0}, 'vt': {0: 34.0, 1: 49.25, 2: 49.25, 3: 0.0, 4: 49.4, 5: 0.0, 6: 0.0, 7: 49.8, 8: 0.0, 9: 50.1, 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0}, 'feed2': {0: '2005-02-05', 1: '2006-10-25', 2: '2006-11-07', 3: '2006-11-07', 4: '2006-12-04', 5: '2006-12-21', 6: '2007-01-08', 7: '2007-02-08', 8: '2007-03-08', 9: '2007-04-10', 10: '2007-04-10', 11: '2007-12-10', 12: '2007-04-10', 13: '2009-01-12'}} )
这是显示所需输出的插图:
下面有几个类似的问题,但不完全相同。
,
如果 need_data == 1
将 date
替换为 NaN
然后使用 GroupBy.ffill
:
df['feed2'] = df['date'].mask(df['need_data'].eq(1)).groupby(df['user']).ffill()
print (df)
user date vt c1 c2 c3 need_data feed1 feed2
0 1 1995-09-01 2 1 3 5 0 1995-09-01 1995-09-01
1 1 1995-09-02 0 0 0 0 1 1995-09-01 1995-09-01
2 1 1995-10-03 0 0 0 0 1 1995-09-01 1995-09-01
3 2 1995-10-04 6 2 2 5 0 1995-10-04 1995-10-04
4 2 1995-10-05 1 3 5 6 0 1995-10-05 1995-10-05
5 2 1995-11-07 1 9 3 4 0 1995-11-07 1995-11-07
6 2 1995-11-08 0 0 0 0 1 1995-11-07 1995-11-07
7 3 1995-11-09 0 2 4 4 0 1995-11-09 1995-11-09
8 3 1995-11-10 0 0 0 0 1 1995-11-09 1995-11-09
9 3 1995-11-15 0 5 6 6 0 1995-12-15 1995-11-15
10 3 1995-12-18 0 5 2 3 0 1995-12-18 1995-12-18
11 4 1995-12-19 0 6 7 4 0 1995-12-19 1995-12-19
12 4 1995-12-20 0 4 0 3 0 1995-12-20 1995-12-20
13 4 1995-12-23 0 0 0 0 1 1995-12-20 1995-12-20
14 4 1995-12-26 0 6 8 2 0 1995-12-26 1995-12-26
15 4 1995-12-27 0 0 0 0 1 1995-12-26 1995-12-26
没有列 new_data
的解决方案,这里是测试字符串 '0'
如果存在于每行的所有列中:
m = df[['vt','c1','c2','c3']].eq('0').all(axis=1)
df['feed2'] = df['date'].mask(m).groupby(df['user']).ffill()
编辑:您需要测试按位 OR
由 |
链接的两个条件,然后如果不等于 m1
:[=25,则使用 Series.where
设置原始值=]
m1 = df['need_data'].eq(1)
m2 = df['vt'].eq(0)
df['feed2'] = df['date'].mask(m1 | m2).groupby(df['user']).ffill().where(m1, df['date'])
print (df)
user date need_data vt feed2
0 848 2005-02-05 0 34.00 2005-02-05
1 848 2006-10-25 0 49.25 2006-10-25
2 848 2006-11-07 0 49.25 2006-11-07
3 848 2006-11-20 1 0.00 2006-11-07
4 848 2006-12-04 0 49.40 2006-12-04
5 848 2006-12-21 0 0.00 2006-12-21
6 848 2007-01-08 0 0.00 2007-01-08
7 848 2007-02-08 0 49.80 2007-02-08
8 848 2007-03-08 0 0.00 2007-03-08
9 848 2007-04-10 0 50.10 2007-04-10
10 848 2007-11-28 1 0.00 2007-04-10
11 848 2007-12-10 0 0.00 2007-12-10
12 848 2009-01-07 1 0.00 2007-04-10
13 848 2009-01-12 0 0.00 2009-01-12
我有这个数据集
df = pd.DataFrame({'user': {0: 848, 1: 848, 2: 848, 3: 848, 4: 848, 5: 848, 6: 848, 7: 848, 8: 848, 9: 848, 10: 848, 11: 848, 12: 848, 13: 848}, \
'date': {0: '2005-02-05', 1: '2006-10-25', 2: '2006-11-07', 3: '2006-11-20', 4: '2006-12-04', 5: '2006-12-21', 6: '2007-01-08', 7: '2007-02-08', 8: '2007-03-08', 9: '2007-04-10', 10: '2007-11-28', 11: '2007-12-10', 12: '2009-01-07', 13: '2009-01-12'},\
'need_data': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 1, 11: 0, 12: 1, 13: 0}, \
'vt': {0: 34.0, 1: 49.25, 2: 49.25, 3: 0.0, 4: 49.4, 5: 0.0, 6: 0.0, 7: 49.8, 8: 0.0, 9: 50.1, 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0}, \
})
我需要新列 ['feed1'],条件如下: 如果 need_data 列等于 1,因此 [vt] 列的值为 0,那么我需要从前一个条目中获取值作为 ['feed1'] 列的值(在同一用户列中) (不同于 0)column[vt].
期望的输出如下:
df = pd.DataFrame( {'user': {0: 848, 1: 848, 2: 848, 3: 848, 4: 848, 5: 848, 6: 848, 7: 848, 8: 848, 9: 848, 10: 848, 11: 848, 12: 848, 13: 848}, 'date': {0: '2005-02-05', 1: '2006-10-25', 2: '2006-11-07', 3: '2006-11-20', 4: '2006-12-04', 5: '2006-12-21', 6: '2007-01-08', 7: '2007-02-08', 8: '2007-03-08', 9: '2007-04-10', 10: '2007-11-28', 11: '2007-12-10', 12: '2009-01-07', 13: '2009-01-12'}, 'need_data': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 1, 11: 0, 12: 1, 13: 0}, 'vt': {0: 34.0, 1: 49.25, 2: 49.25, 3: 0.0, 4: 49.4, 5: 0.0, 6: 0.0, 7: 49.8, 8: 0.0, 9: 50.1, 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0}, 'feed2': {0: '2005-02-05', 1: '2006-10-25', 2: '2006-11-07', 3: '2006-11-07', 4: '2006-12-04', 5: '2006-12-21', 6: '2007-01-08', 7: '2007-02-08', 8: '2007-03-08', 9: '2007-04-10', 10: '2007-04-10', 11: '2007-12-10', 12: '2007-04-10', 13: '2009-01-12'}} )
这是显示所需输出的插图:
下面有几个类似的问题,但不完全相同。
如果 need_data == 1
将 date
替换为 NaN
然后使用 GroupBy.ffill
:
df['feed2'] = df['date'].mask(df['need_data'].eq(1)).groupby(df['user']).ffill()
print (df)
user date vt c1 c2 c3 need_data feed1 feed2
0 1 1995-09-01 2 1 3 5 0 1995-09-01 1995-09-01
1 1 1995-09-02 0 0 0 0 1 1995-09-01 1995-09-01
2 1 1995-10-03 0 0 0 0 1 1995-09-01 1995-09-01
3 2 1995-10-04 6 2 2 5 0 1995-10-04 1995-10-04
4 2 1995-10-05 1 3 5 6 0 1995-10-05 1995-10-05
5 2 1995-11-07 1 9 3 4 0 1995-11-07 1995-11-07
6 2 1995-11-08 0 0 0 0 1 1995-11-07 1995-11-07
7 3 1995-11-09 0 2 4 4 0 1995-11-09 1995-11-09
8 3 1995-11-10 0 0 0 0 1 1995-11-09 1995-11-09
9 3 1995-11-15 0 5 6 6 0 1995-12-15 1995-11-15
10 3 1995-12-18 0 5 2 3 0 1995-12-18 1995-12-18
11 4 1995-12-19 0 6 7 4 0 1995-12-19 1995-12-19
12 4 1995-12-20 0 4 0 3 0 1995-12-20 1995-12-20
13 4 1995-12-23 0 0 0 0 1 1995-12-20 1995-12-20
14 4 1995-12-26 0 6 8 2 0 1995-12-26 1995-12-26
15 4 1995-12-27 0 0 0 0 1 1995-12-26 1995-12-26
没有列 new_data
的解决方案,这里是测试字符串 '0'
如果存在于每行的所有列中:
m = df[['vt','c1','c2','c3']].eq('0').all(axis=1)
df['feed2'] = df['date'].mask(m).groupby(df['user']).ffill()
编辑:您需要测试按位 OR
由 |
链接的两个条件,然后如果不等于 m1
:[=25,则使用 Series.where
设置原始值=]
m1 = df['need_data'].eq(1)
m2 = df['vt'].eq(0)
df['feed2'] = df['date'].mask(m1 | m2).groupby(df['user']).ffill().where(m1, df['date'])
print (df)
user date need_data vt feed2
0 848 2005-02-05 0 34.00 2005-02-05
1 848 2006-10-25 0 49.25 2006-10-25
2 848 2006-11-07 0 49.25 2006-11-07
3 848 2006-11-20 1 0.00 2006-11-07
4 848 2006-12-04 0 49.40 2006-12-04
5 848 2006-12-21 0 0.00 2006-12-21
6 848 2007-01-08 0 0.00 2007-01-08
7 848 2007-02-08 0 49.80 2007-02-08
8 848 2007-03-08 0 0.00 2007-03-08
9 848 2007-04-10 0 50.10 2007-04-10
10 848 2007-11-28 1 0.00 2007-04-10
11 848 2007-12-10 0 0.00 2007-12-10
12 848 2009-01-07 1 0.00 2007-04-10
13 848 2009-01-12 0 0.00 2009-01-12