优化从其他数据帧构建新的数组列
Optimize building a new array column from other dataframes
我尝试合并以下 3 个数据帧:
潜在客户
Id Name Title
0 00Q6F00000zEkMXUA0 V.B Swamy Managing Director
1 00Q6F00000zEkMXUA1 Vandana Suri Founder & CEO
2 00Q6F00000zEkMXUA2 Jane Smith Advisor
活动
Id Name NumberOfLeads
0 7016F000001Oo2xQAC Testing Campaign A 1
1 7016F000001bHoHQAU Testing Campaign B 2
活动成员
CampaignId LeadId
0 7016F000001Oo2xQAC 00Q6F00000zEkMXUA0
1 7016F000001bHoHQAU 00Q6F00000zEkMXUA0
2 7016F000001bHoHQAU 00Q6F00000zEkMXUA1
我想要得到的最终输出是:
通过活动获得线索
Id Name Title Campaigns
0 00Q6F00000zEkMXUA0 V.B Swamy Managing Director ['Testing Campaign A', 'Testing Campaign B']
1 00Q6F00000zEkMXUA1 Vandana Suri Founder & CEO ['Testing Campaign B']
2 00Q6F00000zEkMXUA2 Jane Smith Advisor []
以上输出是通过获取每个 Campaign 的 Campaign Members(即 Leads)列表,然后添加信息作为 Leads 数据框中的新 Campaigns 列。
我能够通过以下逻辑自己实现这一点,但是当 Leads 和 Campaign Members[=44 时,我 运行 遇到了问题=] 数据帧非常大。我的机器在处理数据的一半时内存不足。
# List of all campaign names we process
campaign_keys = []
for index, row in campaigns.iterrows():
cname = row['Name']
cid = row['Id']
# Get members of this campaign
matching_members = campaign_members[campaign_members['CampaignId'] == cid]
# Create dataframe of campaign member lead ids
campaign_df = matching_members['LeadId'].to_frame()
campaign_df = campaign_df.rename(columns={'LeadId': 'Id'})
campaign_df[cname] = 1
# Add to array
campaign_keys.append(cname)
# Merge the campaign members with the leads df
leads_df = leads_df.merge(campaign_df, how='left', left_on='Id', right_on='Id')
# Get only the columns for the campaigns we loaded
cdf = leads_df[campaign_keys]
# Build campaigns column
lists_entry = cdf.eq(1).apply(lambda x: list(x.index[x]), axis=1)
leads_df['Campaigns'] = lists_entry
# Drop all List columns
leads_df.drop(campaign_keys, axis=1)
如何优化此代码以处理更大的负载和更多的活动?
如有任何帮助,我们将不胜感激。谢谢!
尝试:
merge
“CampaignId”上的活动和成员框架
groupby
并为每个“LeadId”创建活动列表
merge
带有新创建框架的线索数据帧
campaign_members = campaigns.merge(members, left_on="Id", right_on="CampaignId", how="right")
campaign_lists = campaign_members.groupby("LeadId")["Name"].agg(list)
output = leads.merge(campaign_lists.rename("Campaigns"), left_on="Id", right_index=True, how="left")
>>> output
Id Name Title Campaigns
0 00Q6F00000zEkMXUA0 V.B Swamy Managing Director ['Testing Campaign A', 'Testing Campaign B']
1 00Q6F00000zEkMXUA1 Vandana Suri Founder & CEO ['Testing Campaign B']
2 00Q6F00000zEkMXUA2 Jane Smith Advisor
我尝试合并以下 3 个数据帧:
潜在客户
Id Name Title
0 00Q6F00000zEkMXUA0 V.B Swamy Managing Director
1 00Q6F00000zEkMXUA1 Vandana Suri Founder & CEO
2 00Q6F00000zEkMXUA2 Jane Smith Advisor
活动
Id Name NumberOfLeads
0 7016F000001Oo2xQAC Testing Campaign A 1
1 7016F000001bHoHQAU Testing Campaign B 2
活动成员
CampaignId LeadId
0 7016F000001Oo2xQAC 00Q6F00000zEkMXUA0
1 7016F000001bHoHQAU 00Q6F00000zEkMXUA0
2 7016F000001bHoHQAU 00Q6F00000zEkMXUA1
我想要得到的最终输出是:
通过活动获得线索
Id Name Title Campaigns
0 00Q6F00000zEkMXUA0 V.B Swamy Managing Director ['Testing Campaign A', 'Testing Campaign B']
1 00Q6F00000zEkMXUA1 Vandana Suri Founder & CEO ['Testing Campaign B']
2 00Q6F00000zEkMXUA2 Jane Smith Advisor []
以上输出是通过获取每个 Campaign 的 Campaign Members(即 Leads)列表,然后添加信息作为 Leads 数据框中的新 Campaigns 列。
我能够通过以下逻辑自己实现这一点,但是当 Leads 和 Campaign Members[=44 时,我 运行 遇到了问题=] 数据帧非常大。我的机器在处理数据的一半时内存不足。
# List of all campaign names we process
campaign_keys = []
for index, row in campaigns.iterrows():
cname = row['Name']
cid = row['Id']
# Get members of this campaign
matching_members = campaign_members[campaign_members['CampaignId'] == cid]
# Create dataframe of campaign member lead ids
campaign_df = matching_members['LeadId'].to_frame()
campaign_df = campaign_df.rename(columns={'LeadId': 'Id'})
campaign_df[cname] = 1
# Add to array
campaign_keys.append(cname)
# Merge the campaign members with the leads df
leads_df = leads_df.merge(campaign_df, how='left', left_on='Id', right_on='Id')
# Get only the columns for the campaigns we loaded
cdf = leads_df[campaign_keys]
# Build campaigns column
lists_entry = cdf.eq(1).apply(lambda x: list(x.index[x]), axis=1)
leads_df['Campaigns'] = lists_entry
# Drop all List columns
leads_df.drop(campaign_keys, axis=1)
如何优化此代码以处理更大的负载和更多的活动?
如有任何帮助,我们将不胜感激。谢谢!
尝试:
merge
“CampaignId”上的活动和成员框架groupby
并为每个“LeadId”创建活动列表merge
带有新创建框架的线索数据帧
campaign_members = campaigns.merge(members, left_on="Id", right_on="CampaignId", how="right")
campaign_lists = campaign_members.groupby("LeadId")["Name"].agg(list)
output = leads.merge(campaign_lists.rename("Campaigns"), left_on="Id", right_index=True, how="left")
>>> output
Id Name Title Campaigns
0 00Q6F00000zEkMXUA0 V.B Swamy Managing Director ['Testing Campaign A', 'Testing Campaign B']
1 00Q6F00000zEkMXUA1 Vandana Suri Founder & CEO ['Testing Campaign B']
2 00Q6F00000zEkMXUA2 Jane Smith Advisor