队列组情况

Question

我正在尝试创建一个同类群组分析，以显示独特购买随时间的变化情况，特殊情况是同类群组应仅由在第一笔订单中使用折扣券的用户组成。

我的数据集是这样的：

╔════╦═════════════════╦══════════════╦═══════════╗
║ id ║ submitted_by_id ║ submitted_at ║ coupon_id ║
╠════╬═════════════════╬══════════════╬═══════════╣
║  1 ║               1 ║ 2015-01-01   ║           ║
║  2 ║               2 ║ 2015-01-02   ║         1 ║
║  3 ║               1 ║ 2015-02-02   ║         1 ║
║  4 ║               3 ║ 2015-02-02   ║           ║
║... ║             ... ║        ...   ║       ... ║
╚════╩═════════════════╩══════════════╩═══════════╝

所以我可以像这样对整个数据集创建群组分析：

import numpy as np
import pandas as pd

data_set = list(data_set)
df = pd.DataFrame(data_set)
df['OrderPeriod'] = df.submitted_at.apply(lambda x: x.strftime('%Y-%m'))

df.set_index('submitted_by_id', inplace=True)
df['CohortGroup'] = df.groupby(level=0)['submitted_at'].min().apply(lambda x: x.strftime('%Y, %m'))
df.reset_index(inplace=True)

grouped = df.groupby(['CohortGroup', 'OrderPeriod'])

cohorts = grouped.agg({
    'submitted_by_id': pd.Series.nunique,
    'id': pd.Series.nunique,
})

cohorts.rename(columns={'id': 'TotalOrdersInPeriod', 'submitted_by_id': 'TotalUsers'}, inplace=True);

cohorts = cohorts.groupby(level=0).apply(cohort_period)
cohorts.reset_index(inplace=True)
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True)

cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first()
cohorts['TotalOrders'] = cohorts.groupby(level=0).TotalOrdersInPeriod.cumsum()

total_buys = cohorts['TotalOrders'].unstack(0).divide(cohort_group_size, axis=1)

这将像这样显示我的同类群组

CohortGroup     2015, 01    2015, 02
CohortPeriod                                                               
1               1           1
2               1.5

所以我想要的是以某种方式将我的队列组限制为那些第一次订单有 coupon_id.

的客户

所以我的结果 table 看起来像这样

CohortGroup     2015, 01    2015, 02
CohortPeriod                                                               
1               1           NaN
2               1

我该怎么做？

归功于 http://www.gregreda.com/2015/08/23/cohort-analysis-with-python/

Answer 1

开始于：

   id  submitted_by_id submitted_at  coupon_id
0   1                1   2015-01-01        NaN
1   2                2   2015-01-02          1
2   3                1   2015-02-02          1
3   4                3   2015-02-02        NaN

您可以通过以下方式获取同类群组和时段：

df['order_period'] = pd.to_datetime(df.submitted_at).dt.to_period('M')
df = df.rename(columns={'submitted_by_id': 'customer_id'}).drop(['id', 'submitted_at'], axis=1)
df['cohort_group'] = df.sort_values('order_period').groupby('customer_id')['order_period'].transform(lambda x: x.head(1))
df['cohort_period'] = df.groupby(['cohort_group', 'customer_id'])['order_period'].rank()

   customer_id  coupon_id order_period cohort_group  cohort_period
0            1        NaN      2015-01      2015-01              1
1            2          1      2015-01      2015-01              1
2            1          1      2015-02      2015-01              2
3            3        NaN      2015-02      2015-02              1

现在您可以过滤掉第一次使用优惠券的客户（样本数据中只有一个）cohort_period:

coupon_customers = df.groupby(['cohort_group', 'customer_id']).apply(lambda x: x.sort_values('cohort_period').iloc[0]).dropna(subset=['coupon_id']).customer_id.tolist()

[2]

基于 customer_id 的 Series，因为它们出现在 cohort_group 和 cohort_period 中：

df = df.set_index(['cohort_group', 'cohort_period']).loc[:, 'customer_id'].to_frame()

                            customer_id
cohort_group cohort_period             
2015-01      1                        1
             1                        2
             2                        1
2015-02      1                        3

您获得 cohort count 优惠券：

cohort_count = df.groupby(level=['cohort_group', 'cohort_period']).count().unstack('cohort_period')

cohort_period           1   2
cohort_group                 
2015-01                 2   1
2015-02                 1 NaN

或者，过滤掉没有优惠券的coupon_customers：

cohort_count_no_coupons = df[~df.isin(coupon_customers)].groupby(level=['cohort_group', 'cohort_period']).count().unstack('cohort_period')

cohort_period           1   2
cohort_group                 
2015-01                 1   1
2015-02                 1 NaN

Answer 2

感谢 Stefan 为我指明了正确的方向，这就是我最终所做的。我会将 Stefans 的回答标记为已接受的答案，因为这是促使我提出解决方案的原因

我稍微扩展了测试数据集，现在看起来像这样：

coupon_id final_amount  id        submitted_at  submitted_by_id OrderPeriod
0        NaN          100   1 2015-01-01 14:30:00                1     2015-01
1          1          100   2 2015-01-02 14:31:00                2     2015-01
2          1          100   3 2015-02-02 14:31:00                1     2015-02
3        NaN          100   4 2015-02-02 14:31:00                3     2015-02
4        NaN          100   5 2015-02-02 14:31:00                2     2015-02
5          2          100   6 2015-01-02 14:31:00                4     2015-01
6          2          100   7 2015-02-03 14:31:00                5     2015-02
7        NaN          100   8 2015-01-03 14:31:00                2     2015-01

这里是 Python 词典：

sample_data = [
        {'id': 1,
         'submitted_by_id': 1,
         'submitted_at': datetime.datetime(2015, 1, 1, 14, 30),
         'final_amount': Decimal('100'),
         'coupon_id': None,
         },
        {'id': 2,
         'submitted_by_id': 2,
         'submitted_at': datetime.datetime(2015, 1, 2, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': 1,
         },
        {'id': 3,
         'submitted_by_id': 1,
         'submitted_at': datetime.datetime(2015, 2, 2, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': 1,
         },
        {'id': 4,
         'submitted_by_id': 3,
         'submitted_at': datetime.datetime(2015, 2, 2, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': None,
         },
        {'id': 5,
         'submitted_by_id': 2,
         'submitted_at': datetime.datetime(2015, 2, 2, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': None,
         },
        {'id': 6,
         'submitted_by_id': 4,
         'submitted_at': datetime.datetime(2015, 1, 2, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': 2,
         },
        {'id': 7,
         'submitted_by_id': 5,
         'submitted_at': datetime.datetime(2015, 2, 3, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': 2,
         },
        {'id': 8,
         'submitted_by_id': 2,
         'submitted_at': datetime.datetime(2015, 1, 3, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': None,
         },
    ]

解决方法如下：

df = pd.DataFrame(sample_data)
df['OrderPeriod'] = df.submitted_at.dt.to_period('M')

if group in ['used_coupon', 'did_not_use_coupon']:
    df2 = df.copy()

    df2['CohortGroup'] = df2.sort_values('OrderPeriod').\
        groupby('submitted_by_id')['OrderPeriod'].transform(lambda x: x.head(1))
    df2['CohortPeriod'] = df2.groupby(
        ['OrderPeriod', 'submitted_by_id']
    )['OrderPeriod'].rank()

    coupon_customers = df2.groupby(['CohortGroup', 'submitted_by_id']).apply(
            lambda x: x.sort_values('submitted_at').iloc[0]
    ).dropna(subset=['coupon_id']).submitted_by_id.tolist()

    # coupon_customers = [2, 4, 5]

    if group == 'used_coupon':
        # delete rows in the original dataframe where the customer is not
        # in the coupon_customers_list
        df = df[df['submitted_by_id'].isin(coupon_customers)]
    # group == 'did_not_use_coupon'
    else: 
        # delete rows in the original dataframe where the customer is
        # in the coupon_customers_list
        df = df[df['submitted_by_id'].isin(coupon_customers)]

# From here it's just the same code as I originally used
df.set_index('submitted_by_id', inplace=True)
df['CohortGroup'] = df.groupby(level=0)['submitted_at'].min().apply(lambda x: x.to_period('M'))

df.reset_index(inplace=True)
print df.head()

grouped = df.groupby(['CohortGroup', 'OrderPeriod'])

cohorts = grouped.agg({
    'submitted_by_id': pd.Series.nunique,
    'id': pd.Series.nunique,
})

cohorts.rename(columns={'id': 'TotalOrdersInPeriod', 'submitted_by_id': 'TotalUsers'}, inplace=True);

cohorts = cohorts.groupby(level=0).apply(cohort_period)

cohorts.reset_index(inplace=True)
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True)

cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first()

cohorts['TotalOrders'] = cohorts.groupby(level=0).TotalOrdersInPeriod.cumsum()

total_buys = cohorts['TotalOrders'].unstack(0).divide(cohort_group_size, axis=1)

组 = 'used_coupon' 的结果：

CohortPeriod    1       2
CohortGroup     
2015-01         1.50    2.00
2015-02         1.00

队列组情况

Cohort group condition

python

statistics

pandas