比较两个数据框的列值。找出哪些值在一个 df 而不是另一个

Compare two dataframes column values. Find which values are in one df and not the other

我有以下数据集

df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/sales.csv')
df["OrderYear"] = pd.DatetimeIndex(df['Order Date']).year

我想对比一下2017年和2018年的客流量,看看这家店有没有流失客源。

我做了两个对应于 2017 年和 2018 年的子集:

Customer_2018 = df.loc[(df.OrderYear == 2018)]
Customer_2017 = df.loc[(df.OrderYear == 2017)]

然后我尝试这样做来比较两者:

Churn = Customer_2017['Customer ID'].isin(Customer_2018['Customer ID']).value_counts()
Churn

我得到以下输出:

True     2206
False     324
Name: Customer ID, dtype: int64

问题是一些客户可能会在数据集中出现多次,因为他们下了几个订单。 我只想获得独特的客户(Customer ID 是唯一的独特属性),然后比较两个数据框以查看商店在 2017 年至 2018 年之间失去了多少客户。

您可以只使用普通集合来获取每年的唯一客户 ID,然后适当地减去它们:

set_lost_cust = set(Customer_2017["Customer ID"]) - set(Customer_2018["Customer ID"])
len(set_lost_cust)

Out: 83

对于您原来的工作方法,您需要从 DataFrame 中删除重复项,以确保每个客户只出现一次:

Customer_2018 = df.loc[(df.OrderYear == 2018), ​"Customer ID"].drop_duplicates()
Customer_2017 = df.loc[(df.OrderYear == 2017), ​"Customer ID"].drop_duplicates()

Churn = Customer_2017.isin(Customer_2018)
Churn.value_counts()

#Out: 
True     552
False     83
Name: Customer ID, dtype: int64

如果只需要一个比较,我会用python sets:

c2017 = set(Customer_2017['Customer ID'])
c2018 = set(Customer_2018['Customer ID'])
print(f'lost customers between 2017 and 2018: {len(c2017 - c2018)}')
print(f'customers from 2017 remaining in 2018: {len(c2017 & c2018)}')
print(f'new customers in 2018: {len(c2018 - c2017)}')

输出:

lost customers between 2017 and 2018: 83
customers from 2017 remaining in 2018: 552
new customers in 2018: 138
基于@Corralien 的crosstab 建议:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
(out.gt(0).astype(int).diff(axis=1)
    .replace({0: 'remained', 1: 'new', -1: 'lost'})
    .apply(pd.Series.value_counts)
)

输出:

OrderYear  2015  2016  2017  2018
lost        NaN   163   123    83
new         NaN   141   191   138
remained    NaN   489   479   572

要进一步分析,可以使用pd.crosstab:

out = pd.crosstab(df['Customer ID'], df['OrderYear'])

此时您的数据框如下所示:

>>> out
OrderYear    2015  2016  2017  2018
Customer ID                        
AA-10315        4     1     4     2
AA-10375        2     4     4     5
AA-10480        1     0    10     1
AA-10645        6     3     8     1
AB-10015        4     0     2     0  # <- lost customer
...           ...   ...   ...   ...
XP-21865       10     3     9     6
YC-21895        3     1     3     1
YS-21880        0     5     0     7
ZC-21910        5     9     9     8
ZD-21925        3     0     5     1

值是每个客户和年份的订单数量。

现在很容易得到“流失的客户”:

>>> sum((out[2017] != 0) & (out[2018] == 0))
83