如何查找和汇总在 Python 中替代其他观察结果的观察结果?

How to find and aggregate observations that replaced others in Python?

上下文:

示例数据集:

df <- data.frame(OppID=c("A123","A765","B456","C137","F879","H987"),OppDate=c("1/20/2020","1/21/2020","7/21/2020","1/4/2020","11/1/2020","8/21/2020"),OppStatus=c("Closed","Closed","Open","Closed","Open","Closed"),Notes=c("","","Replaces A123","","Replaces B456","Replaces A765"))

这是生成的数据帧 (df):

  head(df)
  OppID OppDate   OppStatus Notes
1 A123  1/20/2020 Closed    
2 A765  1/21/2020 Closed   
3 B456  7/21/2020 Open      Refers to A123
4 C137  1/4/2020  Closed    
5 F879  11/1/2020 Open      Refers to B456
6 H987  8/21/2020 Open      Refers to A765

我需要以编程方式完成的事情是这样的(一个新的数据框 'df2'):

  head(df2)
  OppID OppDate   OppStatus Notes               FirstOppDate
1 C137  1/4/2020  Closed    
2 F879  11/1/2020 Open      Refers to A123,B456 1/20/2020
3 H987  8/21/2020 Open      Refers to A765      1/21/2020

如您在 df2 中所见,OppID A123 和 B456 应移至 OppID F879(因为 F879 是 B456 的延续,而 B456 是 A123 的延续)。应创建一个新列来记录最旧的 OppID 的 OppDate(在本例中为 A123,早在 2020 年 1 月 20 日)。

类似的情况发生在H987(A765的延续)。最后,C137 没有改变,因为这个 OppID 不是任何先前 OppID 的延续。

我试图找出一种方法来做到这一点,但到目前为止没有成功。我知道如何从自由文本字段中提取 OppID,但想不出一种方法来检查这种关系并将其聚合到最新的 OppID 中。

有什么想法吗?我希望我想要实现的目标有意义(不是以英语为母语的人)。非常感谢!

import pandas as pd
columns = ['OppID', 'OppDate', 'OppStatus', 'Notes']
rows = [['A123', '1/20/2020', 'Closed' ,''], 
        ['A765', '1/21/2020', 'Closed', ''],
       ['B456', '7/21/2020', 'Open', 'Refers to A123'], 
       ['C137', '1/4/2020' , 'Closed', ''],
       ['F879', '11/1/2020', 'Open' ,'Refers to B456'],
       ['H987', '8/21/2020', 'Open', 'Refers to A765']]

df = pd.DataFrame(rows, columns = columns)

# You can use a regular expression that suits better
df['ref_opp_id'] = [x.split()[-1] if len(x)>0 else None for x in df['Notes']] 

# This can be parallelized or can be further optimized
total_ref_opps = []
first_opp_dates = []
for index, row in df.iterrows():
    total = []
    final_opp_id = row['ref_opp_id']
    first_opp_date = None
    while final_opp_id is not None:
        total.append(final_opp_id)
        first_opp_date = df[df['OppID'] == final_opp_id]['OppDate'].values[0]
        final_opp_id = df[df['OppID'] == final_opp_id]['ref_opp_id'].values[0]
    total_ref_opps.append(total)
    first_opp_dates.append(first_opp_date)

df['total_ref_opps'] = total_ref_opps
df['first_opp_dates'] = first_opp_dates

all_dup_items = [item for sublist in total_ref_opps for item in sublist]

df_new = df[~df['OppID'].isin(all_dup_items)].copy().reset_index(drop=True)

df_new.head()

我是新贡献者。如果这是您正在寻找的,请将此答案标记为成功。