pandas groupby 并获取所有空行,直到多列中的第一个非空值
pandas groupby and get all null rows till the first non null value in multiple columns
我正在尝试使用 group by 获取高于某个值的所有空行。
因此,例如给定以下数据框。
+----+------------+-----------+--------+----------+--------+----------+
| ID | Start Date | End Date | Date_D | D-Values | Date_R | R-Values |
+----+------------+-----------+--------+----------+--------+----------+
| A | 2/26/2015 | 5/26/2015 | JAN_15 | - | 15-Jan | - |
| A | 2/26/2015 | 5/26/2015 | FEB_15 | - | 15-Feb | - |
| A | 2/26/2015 | 5/26/2015 | MAR_15 | - | 15-Mar | - |
| A | 2/26/2015 | 5/26/2015 | APR_15 | - | 15-Apr | - |
| A | 2/26/2015 | 5/26/2015 | MAY_15 | -28 | 15-May | 15000 |
| A | 2/26/2015 | 5/26/2015 | JUN_15 | - | 15-Jun | - |
| A | 2/26/2015 | 5/26/2015 | JUL_15 | - | 15-Jul | - |
| A | 2/26/2015 | 5/26/2015 | AUG_15 | - | 15-Aug | - |
+----+------------+-----------+--------+----------+--------+----------+
我想要的输出如下所示。
+----+------------+-----------+--------+----------+--------+----------+
| ID | Start Date | End Date | Date_D | D-Values | Date_R | R-Values |
+----+------------+-----------+--------+----------+--------+----------+
| A | 2/26/2015 | 5/26/2015 | FEB_15 | - | 15-Feb | - |
| A | 2/26/2015 | 5/26/2015 | MAR_15 | - | 15-Mar | - |
| A | 2/26/2015 | 5/26/2015 | APR_15 | - | 15-Apr | - |
| A | 2/26/2015 | 5/26/2015 | MAY_15 | -28 | 15-May | 15000 |
+----+------------+-----------+--------+----------+--------+----------+
编辑
有多个 ID,因此需要在多个客户上实施。
想要基于开始日期和结束日期的行,例如开始选择从 Feb_15 到日期范围内最后一个非空值的行。
先用DataFrame.isna()
and Series.cumprod()
检查不为空:
df[df[['D-Values','R-Values']]
#.replace('-', np.nan) # if necessary
.isna()
.any(axis=1)
.groupby(df['ID'])
.cumprod()
.groupby(df['ID'])
.shift(fill_value=True)
.astype(bool)
& df['Date_D'].eq('FEB_15')
#.groupby(df['ID']) # BY ID
.cummax()
.eq(1)
]
您可以 transform
和 idxmax
idx = df[['D-Values','R-Values']].notna().all(1).groupby(df["ID"]).transform('idxmax')
out = df[df.index <= idx]
IIUC,您想删除带“-”的最后一行,并假设“D-values”是参考列。
您可以在反向布尔数组上计算 cummax
:
mask = df['D-Values'].ne('-').iloc[::-1].cummax()
# or, for NaNs:
# mask = df['D-Values'].notna().iloc[::-1].cummax()
df2 = df[mask]
输出:
ID Start Date End Date Date_D D-Values Date_R R-Values
0 A 1/26/2015 5/26/2015 JAN_15 - 15-Jan -
1 A 1/26/2015 5/26/2015 FEB_15 - 15-Feb -
2 A 1/26/2015 5/26/2015 MAR_15 - 15-Mar -
3 A 1/26/2015 5/26/2015 APR_15 - 15-Apr -
4 A 1/26/2015 5/26/2015 MAY_15 -28 15-May 15000
我正在尝试使用 group by 获取高于某个值的所有空行。
因此,例如给定以下数据框。
+----+------------+-----------+--------+----------+--------+----------+
| ID | Start Date | End Date | Date_D | D-Values | Date_R | R-Values |
+----+------------+-----------+--------+----------+--------+----------+
| A | 2/26/2015 | 5/26/2015 | JAN_15 | - | 15-Jan | - |
| A | 2/26/2015 | 5/26/2015 | FEB_15 | - | 15-Feb | - |
| A | 2/26/2015 | 5/26/2015 | MAR_15 | - | 15-Mar | - |
| A | 2/26/2015 | 5/26/2015 | APR_15 | - | 15-Apr | - |
| A | 2/26/2015 | 5/26/2015 | MAY_15 | -28 | 15-May | 15000 |
| A | 2/26/2015 | 5/26/2015 | JUN_15 | - | 15-Jun | - |
| A | 2/26/2015 | 5/26/2015 | JUL_15 | - | 15-Jul | - |
| A | 2/26/2015 | 5/26/2015 | AUG_15 | - | 15-Aug | - |
+----+------------+-----------+--------+----------+--------+----------+
我想要的输出如下所示。
+----+------------+-----------+--------+----------+--------+----------+
| ID | Start Date | End Date | Date_D | D-Values | Date_R | R-Values |
+----+------------+-----------+--------+----------+--------+----------+
| A | 2/26/2015 | 5/26/2015 | FEB_15 | - | 15-Feb | - |
| A | 2/26/2015 | 5/26/2015 | MAR_15 | - | 15-Mar | - |
| A | 2/26/2015 | 5/26/2015 | APR_15 | - | 15-Apr | - |
| A | 2/26/2015 | 5/26/2015 | MAY_15 | -28 | 15-May | 15000 |
+----+------------+-----------+--------+----------+--------+----------+
编辑
有多个 ID,因此需要在多个客户上实施。 想要基于开始日期和结束日期的行,例如开始选择从 Feb_15 到日期范围内最后一个非空值的行。
先用DataFrame.isna()
and Series.cumprod()
检查不为空:
df[df[['D-Values','R-Values']]
#.replace('-', np.nan) # if necessary
.isna()
.any(axis=1)
.groupby(df['ID'])
.cumprod()
.groupby(df['ID'])
.shift(fill_value=True)
.astype(bool)
& df['Date_D'].eq('FEB_15')
#.groupby(df['ID']) # BY ID
.cummax()
.eq(1)
]
您可以 transform
和 idxmax
idx = df[['D-Values','R-Values']].notna().all(1).groupby(df["ID"]).transform('idxmax')
out = df[df.index <= idx]
IIUC,您想删除带“-”的最后一行,并假设“D-values”是参考列。
您可以在反向布尔数组上计算 cummax
:
mask = df['D-Values'].ne('-').iloc[::-1].cummax()
# or, for NaNs:
# mask = df['D-Values'].notna().iloc[::-1].cummax()
df2 = df[mask]
输出:
ID Start Date End Date Date_D D-Values Date_R R-Values
0 A 1/26/2015 5/26/2015 JAN_15 - 15-Jan -
1 A 1/26/2015 5/26/2015 FEB_15 - 15-Feb -
2 A 1/26/2015 5/26/2015 MAR_15 - 15-Mar -
3 A 1/26/2015 5/26/2015 APR_15 - 15-Apr -
4 A 1/26/2015 5/26/2015 MAY_15 -28 15-May 15000