Pandas:查找每列的每个时间戳的非 NaN 记录的累计和
Pandas: find the cumulated sum of non-NaN records at each timestamp for each column
我有以下数据框:
timestamp col_A col_B col_C
0 2016-02-15 00:00:00 2.0 NaN NaN
1 2016-02-15 00:01:00 1.0 NaN NaN
2 2016-02-15 00:02:00 4.0 2.0 NaN
3 2016-02-15 00:03:00 2.0 2.0 NaN
4 2016-02-15 00:04:00 7.0 4.1 1.0
5 2016-02-15 00:05:00 2.0 5.0 2.0
6 2016-02-15 00:06:00 2.4 2.0 7.5
7 2016-02-15 00:07:00 2.0 6.3 1.2
8 2016-02-15 00:08:00 2.5 7.0 NaN
我想在每列的每个时间戳处找到非 NaN 记录的累计总和。即预期的输出数据框应该是:
timestamp col_A col_B col_C
0 2016-02-15 00:00:00 1 NaN NaN
1 2016-02-15 00:01:00 2 NaN NaN
2 2016-02-15 00:02:00 3 1 NaN
3 2016-02-15 00:03:00 4 2 NaN
4 2016-02-15 00:04:00 5 3 1
5 2016-02-15 00:05:00 6 4 2
6 2016-02-15 00:06:00 7 5 3
7 2016-02-15 00:07:00 8 6 4
8 2016-02-15 00:08:00 9 7 NaN
我正在遍历数据框并逐条查找 cumsum 记录。但是,我想知道是否有更优雅的方式来做到这一点?谢谢!
使用 notnull
+ cumsum
,请注意,np.nan 是 float 类型,因此将所有 int 数字设为 float。
df.iloc[:,1:]=df.iloc[:,1:].notnull().cumsum()[df.iloc[:,1:].notnull()]
df
Out[33]:
timestamp col_A col_B col_C
0 2016-02-1500:00:00 1 NaN NaN
1 2016-02-1500:01:00 2 NaN NaN
2 2016-02-1500:02:00 3 1.0 NaN
3 2016-02-1500:03:00 4 2.0 NaN
4 2016-02-1500:04:00 5 3.0 1.0
5 2016-02-1500:05:00 6 4.0 2.0
6 2016-02-1500:06:00 7 5.0 3.0
7 2016-02-1500:07:00 8 6.0 4.0
8 2016-02-1500:08:00 9 7.0 NaN
内联 where
df.assign(**(lambda d: d.cumsum().where(d))(df.drop('timestamp', 1).notna()))
timestamp col_A col_B col_C
0 2016-02-15 00:00:00 1 NaN NaN
1 2016-02-15 00:01:00 2 NaN NaN
2 2016-02-15 00:02:00 3 1.0 NaN
3 2016-02-15 00:03:00 4 2.0 NaN
4 2016-02-15 00:04:00 5 3.0 1.0
5 2016-02-15 00:05:00 6 4.0 2.0
6 2016-02-15 00:06:00 7 5.0 3.0
7 2016-02-15 00:07:00 8 6.0 4.0
8 2016-02-15 00:08:00 9 7.0 NaN
替换为update
df.update((lambda d: d.cumsum().where(d))(df.drop('timestamp', 1).notna()))
df
timestamp col_A col_B col_C
0 2016-02-15 00:00:00 1 NaN NaN
1 2016-02-15 00:01:00 2 NaN NaN
2 2016-02-15 00:02:00 3 1.0 NaN
3 2016-02-15 00:03:00 4 2.0 NaN
4 2016-02-15 00:04:00 5 3.0 1.0
5 2016-02-15 00:05:00 6 4.0 2.0
6 2016-02-15 00:06:00 7 5.0 3.0
7 2016-02-15 00:07:00 8 6.0 4.0
8 2016-02-15 00:08:00 9 7.0 NaN
详情
d = df.drop('timestamp', 1).notna()
d.cumsum().where(d)
col_A col_B col_C
0 1 NaN NaN
1 2 NaN NaN
2 3 1.0 NaN
3 4 2.0 NaN
4 5 3.0 1.0
5 6 4.0 2.0
6 7 5.0 3.0
7 8 6.0 4.0
8 9 7.0 NaN
我有以下数据框:
timestamp col_A col_B col_C
0 2016-02-15 00:00:00 2.0 NaN NaN
1 2016-02-15 00:01:00 1.0 NaN NaN
2 2016-02-15 00:02:00 4.0 2.0 NaN
3 2016-02-15 00:03:00 2.0 2.0 NaN
4 2016-02-15 00:04:00 7.0 4.1 1.0
5 2016-02-15 00:05:00 2.0 5.0 2.0
6 2016-02-15 00:06:00 2.4 2.0 7.5
7 2016-02-15 00:07:00 2.0 6.3 1.2
8 2016-02-15 00:08:00 2.5 7.0 NaN
我想在每列的每个时间戳处找到非 NaN 记录的累计总和。即预期的输出数据框应该是:
timestamp col_A col_B col_C
0 2016-02-15 00:00:00 1 NaN NaN
1 2016-02-15 00:01:00 2 NaN NaN
2 2016-02-15 00:02:00 3 1 NaN
3 2016-02-15 00:03:00 4 2 NaN
4 2016-02-15 00:04:00 5 3 1
5 2016-02-15 00:05:00 6 4 2
6 2016-02-15 00:06:00 7 5 3
7 2016-02-15 00:07:00 8 6 4
8 2016-02-15 00:08:00 9 7 NaN
我正在遍历数据框并逐条查找 cumsum 记录。但是,我想知道是否有更优雅的方式来做到这一点?谢谢!
使用 notnull
+ cumsum
,请注意,np.nan 是 float 类型,因此将所有 int 数字设为 float。
df.iloc[:,1:]=df.iloc[:,1:].notnull().cumsum()[df.iloc[:,1:].notnull()]
df
Out[33]:
timestamp col_A col_B col_C
0 2016-02-1500:00:00 1 NaN NaN
1 2016-02-1500:01:00 2 NaN NaN
2 2016-02-1500:02:00 3 1.0 NaN
3 2016-02-1500:03:00 4 2.0 NaN
4 2016-02-1500:04:00 5 3.0 1.0
5 2016-02-1500:05:00 6 4.0 2.0
6 2016-02-1500:06:00 7 5.0 3.0
7 2016-02-1500:07:00 8 6.0 4.0
8 2016-02-1500:08:00 9 7.0 NaN
内联 where
df.assign(**(lambda d: d.cumsum().where(d))(df.drop('timestamp', 1).notna()))
timestamp col_A col_B col_C
0 2016-02-15 00:00:00 1 NaN NaN
1 2016-02-15 00:01:00 2 NaN NaN
2 2016-02-15 00:02:00 3 1.0 NaN
3 2016-02-15 00:03:00 4 2.0 NaN
4 2016-02-15 00:04:00 5 3.0 1.0
5 2016-02-15 00:05:00 6 4.0 2.0
6 2016-02-15 00:06:00 7 5.0 3.0
7 2016-02-15 00:07:00 8 6.0 4.0
8 2016-02-15 00:08:00 9 7.0 NaN
替换为update
df.update((lambda d: d.cumsum().where(d))(df.drop('timestamp', 1).notna()))
df
timestamp col_A col_B col_C
0 2016-02-15 00:00:00 1 NaN NaN
1 2016-02-15 00:01:00 2 NaN NaN
2 2016-02-15 00:02:00 3 1.0 NaN
3 2016-02-15 00:03:00 4 2.0 NaN
4 2016-02-15 00:04:00 5 3.0 1.0
5 2016-02-15 00:05:00 6 4.0 2.0
6 2016-02-15 00:06:00 7 5.0 3.0
7 2016-02-15 00:07:00 8 6.0 4.0
8 2016-02-15 00:08:00 9 7.0 NaN
详情
d = df.drop('timestamp', 1).notna()
d.cumsum().where(d)
col_A col_B col_C
0 1 NaN NaN
1 2 NaN NaN
2 3 1.0 NaN
3 4 2.0 NaN
4 5 3.0 1.0
5 6 4.0 2.0
6 7 5.0 3.0
7 8 6.0 4.0
8 9 7.0 NaN