优化从其他数据帧构建新的数组列

Optimize building a new array column from other dataframes

我尝试合并以下 3 个数据帧:

潜在客户

                   Id            Name              Title
0  00Q6F00000zEkMXUA0       V.B Swamy  Managing Director
1  00Q6F00000zEkMXUA1    Vandana Suri      Founder & CEO
2  00Q6F00000zEkMXUA2      Jane Smith            Advisor

活动

                    Id                      Name    NumberOfLeads
0   7016F000001Oo2xQAC        Testing Campaign A                1
1   7016F000001bHoHQAU        Testing Campaign B                2

活动成员

             CampaignId              LeadId
0    7016F000001Oo2xQAC  00Q6F00000zEkMXUA0
1    7016F000001bHoHQAU  00Q6F00000zEkMXUA0
2    7016F000001bHoHQAU  00Q6F00000zEkMXUA1

我想要得到的最终输出是:

通过活动获得线索

                   Id            Name              Title                   Campaigns
0  00Q6F00000zEkMXUA0       V.B Swamy  Managing Director      ['Testing Campaign A', 'Testing Campaign B']
1  00Q6F00000zEkMXUA1    Vandana Suri      Founder & CEO      ['Testing Campaign B']
2  00Q6F00000zEkMXUA2      Jane Smith            Advisor      []

以上输出是通过获取每个 CampaignCampaign Members(即 Leads)列表,然后添加信息作为 Leads 数据框中的新 Campaigns 列。

我能够通过以下逻辑自己实现这一点,但是当 LeadsCampaign Members[=44 时,我 运行 遇到了问题=] 数据帧非常大。我的机器在处理数据的一半时内存不足。

# List of all campaign names we process
campaign_keys = []

for index, row in campaigns.iterrows():
    cname = row['Name']
    cid = row['Id']

    # Get members of this campaign
    matching_members = campaign_members[campaign_members['CampaignId'] == cid]
    
    # Create dataframe of campaign member lead ids
    campaign_df = matching_members['LeadId'].to_frame()
    campaign_df = campaign_df.rename(columns={'LeadId': 'Id'})
    campaign_df[cname] = 1

    # Add to array
    campaign_keys.append(cname)

    # Merge the campaign members with the leads df
    leads_df = leads_df.merge(campaign_df, how='left', left_on='Id', right_on='Id')

    
# Get only the columns for the campaigns we loaded
cdf = leads_df[campaign_keys]

# Build campaigns column
lists_entry = cdf.eq(1).apply(lambda x: list(x.index[x]), axis=1)
leads_df['Campaigns'] = lists_entry

# Drop all List columns
leads_df.drop(campaign_keys, axis=1)    

如何优化此代码以处理更大的负载和更多的活动?

如有任何帮助,我们将不胜感激。谢谢!

尝试:

  1. merge“CampaignId”上的活动和成员框架
  2. groupby 并为每个“LeadId”创建活动列表
  3. merge 带有新创建框架的线索数据帧
campaign_members = campaigns.merge(members, left_on="Id", right_on="CampaignId", how="right")
campaign_lists = campaign_members.groupby("LeadId")["Name"].agg(list)
output = leads.merge(campaign_lists.rename("Campaigns"), left_on="Id", right_index=True, how="left")

>>> output

                    Id          Name                    Title               Campaigns
0   00Q6F00000zEkMXUA0     V.B Swamy        Managing Director       ['Testing Campaign A', 'Testing Campaign B']
1   00Q6F00000zEkMXUA1  Vandana Suri            Founder & CEO       ['Testing Campaign B']
2   00Q6F00000zEkMXUA2    Jane Smith                  Advisor