使用条件按 col 搜索值 col 的第一次出现？

Question

我有一个如下所示的数据框

stud_id prod_id total_qty   ques_date   inv_qty inv_date    bkl_qty bkl_date    csum    accu_qty    accu_date   upto_inv_threshold  upto_bkl_threshold  upto_accu_threshold
0   101 12  100 13/11/2010  7.00000 16/02/2012  15  2013-01-16  15  10  13/08/2021  7.00000 22.00000    32.00000
1   101 12  100 13/11/2010  7.00000 16/02/2012  40  2011-10-22  55  10  13/08/2021  7.00000 62.00000    72.00000
2   101 12  100 13/11/2010  7.00000 16/02/2012  2   2019-09-10  57  10  13/08/2021  7.00000 64.00000    74.00000

df = pd.read_clipboard()

我想执行下面列出的两个步骤

step-1) 在数据框中搜索 >=50 的值，并且 return 仅在第 1 次出现。

仅在 3 列中执行上述搜索 - upto_inv_threshold、upto_bkl_threshold、upto_accu_threshold 但按列进行。意思是，先在一列中完成搜索，然后移动到下一列。例如：我们首先搜索 upto_inv_threshold 的所有值，然后我们搜索 upto_bkl_threshold 的所有值，later/finally 我们搜索 upto_accu_threshold

的所有值

step-2) 获取在步骤 1 中找到的第一个出现值的相应日期。如果找到该值 upto_inv_threshold，则获取 inv_date。如果在 upto_bkl_threshold 中找到第一个出现值，则获取 bkl_date。如果在 upto_accu_threshold 中找到第一个出现值，则获取 accu_date.

我尝试了以下

df_stage_3.loc[:, 'upto_inv_threshold':'upto_accu_threshold']
np.where(df_stage_3.loc[:, 'upto_inv_threshold':'upto_accu_threshold']>=50)

但这无济于事，我无法继续进行。

我们必须为每个 stud_id 和 prod_id 执行此操作。目前，在示例数据中我们只有组，但在实际数据中我们会有多组 stud_id 和 prod_id.

我希望我的输出如下所示。我们从 bkl_date 列中获取日期，因为第一个值（符合我们的标准 >=50）是 62（存在于 upto_bkl_threshold）

stud_id, prod_id, fifty_pct_date
101,      12,       2011-10-22

Answer 1

Select 所需的 cols，然后创建一个布尔值 mask 来识别阈值中的单元格，例如值 > 50 的列，然后使用此布尔掩码来屏蔽值在相应的日期列中。现在 group 数据框 stud_id 和 prod_id 并使用 first 进行聚合，最后 bfill（回填）沿列轴获得第一次出现的日期已达到阈值。

cols = pd.Index(['inv', 'bkl', 'accu'])
mask = df['upto_' + cols + '_threshold'].gt(50)

(
    df[cols + '_date']
    .where(mask.to_numpy())
    .groupby([df['stud_id'], df['prod_id']]).first()
    .bfill(axis=1).iloc[:, 0]
    .rename('fifty_pct_date')
    .reset_index()
)

结果

   stud_id  prod_id fifty_pct_date
0      101       12     2011-10-22

Answer 2

我想你也可以通过以下代码获取目标日期：

代码：

import pandas as pd

# Create a sample dataframe
df = pd.DataFrame({'stud_id': {0: 101, 1: 101, 2: 101}, 'prod_id': {0: 12, 1: 12, 2: 12}, 'total_qty': {0: 100, 1: 100, 2: 100}, 'ques_date': {0: '13/11/2010', 1: '13/11/2010', 2: '13/11/2010'}, 'inv_qty': {0: 7.0, 1: 7.0, 2: 7.0}, 'inv_date': {0: '16/02/2012', 1: '16/02/2012', 2: '16/02/2012'}, 'bkl_qty': {0: 15, 1: 40, 2: 2}, 'bkl_date': {0: '2013-01-16', 1: '2011-10-22', 2: '2019-09-10'}, 'csum': {0: 15, 1: 55, 2: 57}, 'accu_qty': {0: 10, 1: 10, 2: 10}, 'accu_date': {0: '13/08/2021', 1: '13/08/2021', 2: '13/08/2021'}, 'upto_inv_threshold': {0: 7.0, 1: 7.0, 2: 7.0}, 'upto_bkl_threshold': {0: 22.0, 1: 62.0, 2: 64.0}, 'upto_accu_threshold': {0: 32.0, 1: 72.0, 2: 74.0}})

# Transform df
symbols = ['inv', 'bkl', 'accu']
df1 = df.melt(['stud_id', 'prod_id'], [f'{s}_date' for s in symbols], value_name='date')
df2 = df.melt(['stud_id', 'prod_id'], [f'upto_{s}_threshold' for s in symbols], value_name='threshold')

# Merge and get the target date(s)
df = df1.join(df2.loc[df2.threshold>=50, 'threshold'], how='inner')
df = df.groupby(['stud_id', 'prod_id'], as_index=False)['date'].first()

print(df)

输出：

stud_id	prod_id	date
101	12	2011-10-22

使用条件按 col 搜索值 col 的第一次出现？

Search the first occurrence of a value col by col using a criteria?

python

numpy

dataframe

pandas

pandas-groupby

代码：

输出：