比较两个数据框的列值。找出哪些值在一个 df 而不是另一个
Compare two dataframes column values. Find which values are in one df and not the other
我有以下数据集
df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/sales.csv')
df["OrderYear"] = pd.DatetimeIndex(df['Order Date']).year
我想对比一下2017年和2018年的客流量,看看这家店有没有流失客源。
我做了两个对应于 2017 年和 2018 年的子集:
Customer_2018 = df.loc[(df.OrderYear == 2018)]
Customer_2017 = df.loc[(df.OrderYear == 2017)]
然后我尝试这样做来比较两者:
Churn = Customer_2017['Customer ID'].isin(Customer_2018['Customer ID']).value_counts()
Churn
我得到以下输出:
True 2206
False 324
Name: Customer ID, dtype: int64
问题是一些客户可能会在数据集中出现多次,因为他们下了几个订单。
我只想获得独特的客户(Customer ID
是唯一的独特属性),然后比较两个数据框以查看商店在 2017 年至 2018 年之间失去了多少客户。
您可以只使用普通集合来获取每年的唯一客户 ID,然后适当地减去它们:
set_lost_cust = set(Customer_2017["Customer ID"]) - set(Customer_2018["Customer ID"])
len(set_lost_cust)
Out: 83
对于您原来的工作方法,您需要从 DataFrame 中删除重复项,以确保每个客户只出现一次:
Customer_2018 = df.loc[(df.OrderYear == 2018), "Customer ID"].drop_duplicates()
Customer_2017 = df.loc[(df.OrderYear == 2017), "Customer ID"].drop_duplicates()
Churn = Customer_2017.isin(Customer_2018)
Churn.value_counts()
#Out:
True 552
False 83
Name: Customer ID, dtype: int64
如果只需要一个比较,我会用python sets:
c2017 = set(Customer_2017['Customer ID'])
c2018 = set(Customer_2018['Customer ID'])
print(f'lost customers between 2017 and 2018: {len(c2017 - c2018)}')
print(f'customers from 2017 remaining in 2018: {len(c2017 & c2018)}')
print(f'new customers in 2018: {len(c2018 - c2017)}')
输出:
lost customers between 2017 and 2018: 83
customers from 2017 remaining in 2018: 552
new customers in 2018: 138
基于@Corralien 的crosstab
建议:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
(out.gt(0).astype(int).diff(axis=1)
.replace({0: 'remained', 1: 'new', -1: 'lost'})
.apply(pd.Series.value_counts)
)
输出:
OrderYear 2015 2016 2017 2018
lost NaN 163 123 83
new NaN 141 191 138
remained NaN 489 479 572
要进一步分析,可以使用pd.crosstab
:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
此时您的数据框如下所示:
>>> out
OrderYear 2015 2016 2017 2018
Customer ID
AA-10315 4 1 4 2
AA-10375 2 4 4 5
AA-10480 1 0 10 1
AA-10645 6 3 8 1
AB-10015 4 0 2 0 # <- lost customer
... ... ... ... ...
XP-21865 10 3 9 6
YC-21895 3 1 3 1
YS-21880 0 5 0 7
ZC-21910 5 9 9 8
ZD-21925 3 0 5 1
值是每个客户和年份的订单数量。
现在很容易得到“流失的客户”:
>>> sum((out[2017] != 0) & (out[2018] == 0))
83
我有以下数据集
df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/sales.csv')
df["OrderYear"] = pd.DatetimeIndex(df['Order Date']).year
我想对比一下2017年和2018年的客流量,看看这家店有没有流失客源。
我做了两个对应于 2017 年和 2018 年的子集:
Customer_2018 = df.loc[(df.OrderYear == 2018)]
Customer_2017 = df.loc[(df.OrderYear == 2017)]
然后我尝试这样做来比较两者:
Churn = Customer_2017['Customer ID'].isin(Customer_2018['Customer ID']).value_counts()
Churn
我得到以下输出:
True 2206
False 324
Name: Customer ID, dtype: int64
问题是一些客户可能会在数据集中出现多次,因为他们下了几个订单。
我只想获得独特的客户(Customer ID
是唯一的独特属性),然后比较两个数据框以查看商店在 2017 年至 2018 年之间失去了多少客户。
您可以只使用普通集合来获取每年的唯一客户 ID,然后适当地减去它们:
set_lost_cust = set(Customer_2017["Customer ID"]) - set(Customer_2018["Customer ID"])
len(set_lost_cust)
Out: 83
对于您原来的工作方法,您需要从 DataFrame 中删除重复项,以确保每个客户只出现一次:
Customer_2018 = df.loc[(df.OrderYear == 2018), "Customer ID"].drop_duplicates()
Customer_2017 = df.loc[(df.OrderYear == 2017), "Customer ID"].drop_duplicates()
Churn = Customer_2017.isin(Customer_2018)
Churn.value_counts()
#Out:
True 552
False 83
Name: Customer ID, dtype: int64
如果只需要一个比较,我会用python sets:
c2017 = set(Customer_2017['Customer ID'])
c2018 = set(Customer_2018['Customer ID'])
print(f'lost customers between 2017 and 2018: {len(c2017 - c2018)}')
print(f'customers from 2017 remaining in 2018: {len(c2017 & c2018)}')
print(f'new customers in 2018: {len(c2018 - c2017)}')
输出:
lost customers between 2017 and 2018: 83
customers from 2017 remaining in 2018: 552
new customers in 2018: 138
基于@Corralien 的crosstab
建议:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
(out.gt(0).astype(int).diff(axis=1)
.replace({0: 'remained', 1: 'new', -1: 'lost'})
.apply(pd.Series.value_counts)
)
输出:
OrderYear 2015 2016 2017 2018
lost NaN 163 123 83
new NaN 141 191 138
remained NaN 489 479 572
要进一步分析,可以使用pd.crosstab
:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
此时您的数据框如下所示:
>>> out
OrderYear 2015 2016 2017 2018
Customer ID
AA-10315 4 1 4 2
AA-10375 2 4 4 5
AA-10480 1 0 10 1
AA-10645 6 3 8 1
AB-10015 4 0 2 0 # <- lost customer
... ... ... ... ...
XP-21865 10 3 9 6
YC-21895 3 1 3 1
YS-21880 0 5 0 7
ZC-21910 5 9 9 8
ZD-21925 3 0 5 1
值是每个客户和年份的订单数量。
现在很容易得到“流失的客户”:
>>> sum((out[2017] != 0) & (out[2018] == 0))
83