如何通过比较其他列中的原始数据(特定列中具有相同的值)来查找满足特定条件的原始数据

How to find raws that meet certain conditions by comparing raws (with the same value in a specific column) in other columns

我有一个如下所示的数据框。

我想获取满足我正在寻找的条件的行,比较多列中的值 between/among 行在特定列中具有相同的值。

df = pd.DataFrame({'Customer_ID':[750, 1135, 1403, 7144,7144,7144,10424,10984,10984,12710,12710,
                              7151,7151,7152,13249,13249,13249,9303,9303,9461,9461,9478,9478,
                              9478,9710,9710],
               'Age':[25,36,63,56,56,56,25,45,45,38,38,16,16,73,50,50,50,41,41,63,63,22,22,22,
                      34,34],
              'Subscription_Product':['Product A','Product B','Product C','Product D',
                                      'Product C','Product A','Product C','Product B',
                                      'Product A','Product D','Product A','Product A',
                                      'Product C','Product D','Product B','Product A',
                                      'Product A','Product D','Product A','Product A',
                                      'Product C','Product A','Product B','Product A',
                                      'Product A','Product D'],
              'Entry_Date':['2011-09-25','2015-08-25','2015-10-25','1999-06-26',
                                         '1995-06-29','2002-08-25','2001-07-22','1995-01-29',
                                         '1997-05-10','2012-10-10','2015-06-10','1995-01-15',
                                         '2002-02-24','2019-04-25','1995-01-19','2001-02-25',
                                         '2014-03-15','2002-07-24','1997-03-19','2001-03-14',
                                         '2005-02-23','2001-02-02','2007-12-18','2010-12-18',
                                         '2013-01-09','2013-05-15'],
              'Cancellation_Date':['','','','','','','','','','','2002-08-30','','','',
                                   '2008-02-25','','','','','','','2011-12-18','','','','']})

Click to see the dataframe printed out

我正在尝试搜索已经订阅了一个或多个产品的客户,has/had 后来又 to/subscribed 订阅了 'Product A'。当客户为 'Product A' 添加订阅时,客户必须已经或已经主动订阅了产品。

换句话说,我想从数据框中获取行、索引号或客户 ID,其中:

(1) 具有相同的 [客户 ID]

(2) 并且必须有 'Product A' 才能订阅;这意味着如果客户 has/had 'Product A' 以及其他订阅产品,'Product A' 的订阅开始日期 ([Entry_Date]) 必须大于其他产品的订阅开始日期他 has/had.

的订阅

(3) 当客户已经取消订阅时,如果客户旧订阅的取消日期大于他的 'Product A' 订阅的开始日期,则计算在内。 (因为这意味着他在积极订阅其他产品的同时订阅了产品 A)

举个例子,客户 ID 为“10984”的客户另外订阅了产品 A,而他已经订阅了其他产品。该客户(行或客户 ID)是我正在寻找的客户之一。

再比如,ID#'9478'的客户在取消订阅'Product B'之前订阅了'Product A',这意味着他订阅了'Product A' , 他被积极订阅了一个产品。那么,他也是我要找的人之一。

我尝试使用 .loc,但还不足以满足我的需求。也许使用循环?

如果有人能帮我解决这个问题,我将不胜感激。提前谢谢你。

考虑使用下面提供的函数。

该函数会忽略所有未进行过一次以上订阅的客户,除此之外,它还会忽略那些未购买至少一次订阅“产品 A"(这提高了程序的效率,以防你想将它与多行数据框一起使用,尽管对于提供的示例,它在效率方面不会产生太大差异)。

之后,它循环遍历生成的数据框,以查看哪些客户符合您要查找的条件。

def findSpecialCustomer(df):
    # Create dataframe with only repeated customers 
    aux = df[df.duplicated(subset=['Customer_ID'], keep = False)].reset_index(drop = True)
    # List of customers who acquired "Product A"
    listOfRepeated = pd.unique(df[df.Subscription_Product.eq('Product A')]['Customer_ID'])
    # Discard from our previous dataframe the customer who didn't acquired "Product A"
    aux = aux[aux.Customer_ID.isin(listOfRepeated)]
    
    # Empty list that will contain the list of customers that we are looking for
    customersFound = []
    
    # Iterate for each customer in our filtered dataframe
    for i in pd.unique(aux['Customer_ID']):
        # Auxiliar boolean to check if the customer has a subscription other than "Product A"
        prevSubscription = False
        # Auxiliar String to check if it has cancellation date and compare it
        cancellation_date = ""
        # Iterate through each product that a same customer has acquired
        for index, row in aux[aux.Customer_ID.eq(i)].iterrows():
            # If the customer has acquired a product different than "Product A" set boolean to True
            if(row['Subscription_Product']!='Product A'):
                prevSubscription = True
                # In case Cancellation_Date exists, save it in auxiliar variable
                if (row['Cancellation_Date'] != ""):
                   cancellation_date = row['Cancellation_Date']
            # If the customer has bought a product different than A before and it is buying product A
            if((row['Subscription_Product']=='Product A')&(prevSubscription == True)):
                # If the customer has a cancellation date in a product different than product A
                if (cancellation_date != ""):
                    # If the subscription to the other product hasn't been cancelled by the time 
                    # the customer acquires "Subscription A", then add it to our list
                    if (cancellation_date>row['Entry_Date']):
                        # Save customer in auxiliar list
                        customersFound.append(row['Customer_ID'])
                else:
                    # Save customer in auxiliar list
                    customersFound.append(row['Customer_ID'])
    
    return customersFound

# Our list of the ID's of the customers with the conditions we are looking for
listOfUsersFound = findSpecialCustomer(df)
listOfUsersFound
Out[2]: [7144, 10984, 12710, 13249, 9303, 9478]

[编辑+] 如果您还想要另一个功能 returns 一个列表,其中包含购买了“产品 A”的客户,同时又订阅了另一个“产品 A" 你可以像我在评论中所说的那样修改我之前的答案中的函数。但我也推荐一种稍微不同的方法来处理多次取消。

def findMultipleASubscriptionCustomers(df):

# Create dataframe with customers who have more than one Product A subscription
aux = df[df.Subscription_Product == 'Product A'].groupby('Customer_ID').filter(lambda x: x.count()[['Subscription_Product']]>1)

# Empty list that will contain the list of customers that we are looking for
customersFound = []

# Iterate for each customer in our filtered dataframe
for i in pd.unique(aux['Customer_ID']):
    # Auxiliar boolean to check if it has cancellation date and compare it
    subscriptionsWithoutCancelDate = False

    #Auxiliar list to save all cancellation dates 
    listOfCancellDates = []
    # Iterate through each A product that a same customer has acquired
    for index, row in aux[aux.Customer_ID.eq(i)].iterrows():
        
        # If we've iterated through our dataframe and we've found one product A without Cancellation Date
        if(subscriptionsWithoutCancelDate==True):
            # Save customer in our final list
            customersFound.append(row['Customer_ID'])
            # break out of loop to continue with next customer
            break
        else:
            # If we've collected previously cancellation dates and our list is NOT EMPTY
            if (not listOfCancellDates):
                # If the dates in our list are ALL after the entry date of our actual subscription
                # (which means the customer purchased this A Product while having other A product)
                if all(row['Entry_Date'] < dates for dates in listOfCancellDates):
                    # Save customer in our final list
                    customersFound.append(row['Customer_ID'])
                    # break out of loop to continue with next customer
                    break
                    
            # If case Cancellation_Date exists
            if (row['Cancellation_Date'] != ""):
                # Save Cancellation Date into our list
                listOfCancellDates.append(row['Cancellation_Date'])
            else:
                # This product doesn't have a cancellation date which means that if the same 
                # customer buys another A Product (if we iterate one more time this for loop)
                # it will mean that it's buying it's SECOND A product. (We will add it to our 
                # customer list and break the loop)
                subscriptionsWithoutCancelDate=True
                
return customersFound

auxList = findMultipleASubscriptionCustomers(df)
auxList
Out[2]: [13249, 9478]