如何查找和汇总在 Python 中替代其他观察结果的观察结果?
How to find and aggregate observations that replaced others in Python?
上下文:
- 一位销售合作伙伴注册了一个商机 (OppID),如果未关闭,该商机将在 180 天后过期;
- 到期后,同一个销售伙伴注册了一个新的OppID,但实际上是对过去OppID的延续;
- 在手动自由文本字段(注释)中有对过去 OppID 的引用,其中包含类似“替换”的字符串;
示例数据集:
df <- data.frame(OppID=c("A123","A765","B456","C137","F879","H987"),OppDate=c("1/20/2020","1/21/2020","7/21/2020","1/4/2020","11/1/2020","8/21/2020"),OppStatus=c("Closed","Closed","Open","Closed","Open","Closed"),Notes=c("","","Replaces A123","","Replaces B456","Replaces A765"))
这是生成的数据帧 (df):
head(df)
OppID OppDate OppStatus Notes
1 A123 1/20/2020 Closed
2 A765 1/21/2020 Closed
3 B456 7/21/2020 Open Refers to A123
4 C137 1/4/2020 Closed
5 F879 11/1/2020 Open Refers to B456
6 H987 8/21/2020 Open Refers to A765
我需要以编程方式完成的事情是这样的(一个新的数据框 'df2'):
head(df2)
OppID OppDate OppStatus Notes FirstOppDate
1 C137 1/4/2020 Closed
2 F879 11/1/2020 Open Refers to A123,B456 1/20/2020
3 H987 8/21/2020 Open Refers to A765 1/21/2020
如您在 df2 中所见,OppID A123 和 B456 应移至 OppID F879(因为 F879 是 B456 的延续,而 B456 是 A123 的延续)。应创建一个新列来记录最旧的 OppID 的 OppDate(在本例中为 A123,早在 2020 年 1 月 20 日)。
类似的情况发生在H987(A765的延续)。最后,C137 没有改变,因为这个 OppID 不是任何先前 OppID 的延续。
我试图找出一种方法来做到这一点,但到目前为止没有成功。我知道如何从自由文本字段中提取 OppID,但想不出一种方法来检查这种关系并将其聚合到最新的 OppID 中。
有什么想法吗?我希望我想要实现的目标有意义(不是以英语为母语的人)。非常感谢!
import pandas as pd
columns = ['OppID', 'OppDate', 'OppStatus', 'Notes']
rows = [['A123', '1/20/2020', 'Closed' ,''],
['A765', '1/21/2020', 'Closed', ''],
['B456', '7/21/2020', 'Open', 'Refers to A123'],
['C137', '1/4/2020' , 'Closed', ''],
['F879', '11/1/2020', 'Open' ,'Refers to B456'],
['H987', '8/21/2020', 'Open', 'Refers to A765']]
df = pd.DataFrame(rows, columns = columns)
# You can use a regular expression that suits better
df['ref_opp_id'] = [x.split()[-1] if len(x)>0 else None for x in df['Notes']]
# This can be parallelized or can be further optimized
total_ref_opps = []
first_opp_dates = []
for index, row in df.iterrows():
total = []
final_opp_id = row['ref_opp_id']
first_opp_date = None
while final_opp_id is not None:
total.append(final_opp_id)
first_opp_date = df[df['OppID'] == final_opp_id]['OppDate'].values[0]
final_opp_id = df[df['OppID'] == final_opp_id]['ref_opp_id'].values[0]
total_ref_opps.append(total)
first_opp_dates.append(first_opp_date)
df['total_ref_opps'] = total_ref_opps
df['first_opp_dates'] = first_opp_dates
all_dup_items = [item for sublist in total_ref_opps for item in sublist]
df_new = df[~df['OppID'].isin(all_dup_items)].copy().reset_index(drop=True)
df_new.head()
我是新贡献者。如果这是您正在寻找的,请将此答案标记为成功。
上下文:
- 一位销售合作伙伴注册了一个商机 (OppID),如果未关闭,该商机将在 180 天后过期;
- 到期后,同一个销售伙伴注册了一个新的OppID,但实际上是对过去OppID的延续;
- 在手动自由文本字段(注释)中有对过去 OppID 的引用,其中包含类似“替换”的字符串;
示例数据集:
df <- data.frame(OppID=c("A123","A765","B456","C137","F879","H987"),OppDate=c("1/20/2020","1/21/2020","7/21/2020","1/4/2020","11/1/2020","8/21/2020"),OppStatus=c("Closed","Closed","Open","Closed","Open","Closed"),Notes=c("","","Replaces A123","","Replaces B456","Replaces A765"))
这是生成的数据帧 (df):
head(df)
OppID OppDate OppStatus Notes
1 A123 1/20/2020 Closed
2 A765 1/21/2020 Closed
3 B456 7/21/2020 Open Refers to A123
4 C137 1/4/2020 Closed
5 F879 11/1/2020 Open Refers to B456
6 H987 8/21/2020 Open Refers to A765
我需要以编程方式完成的事情是这样的(一个新的数据框 'df2'):
head(df2)
OppID OppDate OppStatus Notes FirstOppDate
1 C137 1/4/2020 Closed
2 F879 11/1/2020 Open Refers to A123,B456 1/20/2020
3 H987 8/21/2020 Open Refers to A765 1/21/2020
如您在 df2 中所见,OppID A123 和 B456 应移至 OppID F879(因为 F879 是 B456 的延续,而 B456 是 A123 的延续)。应创建一个新列来记录最旧的 OppID 的 OppDate(在本例中为 A123,早在 2020 年 1 月 20 日)。
类似的情况发生在H987(A765的延续)。最后,C137 没有改变,因为这个 OppID 不是任何先前 OppID 的延续。
我试图找出一种方法来做到这一点,但到目前为止没有成功。我知道如何从自由文本字段中提取 OppID,但想不出一种方法来检查这种关系并将其聚合到最新的 OppID 中。
有什么想法吗?我希望我想要实现的目标有意义(不是以英语为母语的人)。非常感谢!
import pandas as pd
columns = ['OppID', 'OppDate', 'OppStatus', 'Notes']
rows = [['A123', '1/20/2020', 'Closed' ,''],
['A765', '1/21/2020', 'Closed', ''],
['B456', '7/21/2020', 'Open', 'Refers to A123'],
['C137', '1/4/2020' , 'Closed', ''],
['F879', '11/1/2020', 'Open' ,'Refers to B456'],
['H987', '8/21/2020', 'Open', 'Refers to A765']]
df = pd.DataFrame(rows, columns = columns)
# You can use a regular expression that suits better
df['ref_opp_id'] = [x.split()[-1] if len(x)>0 else None for x in df['Notes']]
# This can be parallelized or can be further optimized
total_ref_opps = []
first_opp_dates = []
for index, row in df.iterrows():
total = []
final_opp_id = row['ref_opp_id']
first_opp_date = None
while final_opp_id is not None:
total.append(final_opp_id)
first_opp_date = df[df['OppID'] == final_opp_id]['OppDate'].values[0]
final_opp_id = df[df['OppID'] == final_opp_id]['ref_opp_id'].values[0]
total_ref_opps.append(total)
first_opp_dates.append(first_opp_date)
df['total_ref_opps'] = total_ref_opps
df['first_opp_dates'] = first_opp_dates
all_dup_items = [item for sublist in total_ref_opps for item in sublist]
df_new = df[~df['OppID'].isin(all_dup_items)].copy().reset_index(drop=True)
df_new.head()
我是新贡献者。如果这是您正在寻找的,请将此答案标记为成功。