计算数据框中各组之间的欧式距离
Calculate euclidean distance between groups in a data frame
我有以下形式的各个商店的每周数据:
pd.DataFrame({'Store':['S1', 'S1', 'S1', 'S2','S2','S2','S3','S3','S3'], 'Week':[1, 2, 3,1,2,3,1,2,3],
'Sales' : [20,30,40,21,31,41,22,32,42],'Cust_count' : [2,4,6,3,5,7,4,6,8]})
Store Week Sales Cust_count
0 S1 1 20 2
1 S1 2 30 4
2 S1 3 40 6
3 S2 1 21 3
4 S2 2 31 5
5 S2 3 41 7
6 S3 1 22 4
7 S3 2 32 6
8 S3 3 42 8
如您所见,数据处于商店周级别,我想计算同一周每家商店之间的欧氏距离,然后取计算距离的平均值。例如,Store S1 和 S2 的计算如下所示:
For week 1: sqrt((20-21)^2 + (2-3)^2) = sqrt(2)
For week 2: sqrt((30-31)^2 + (4-5)^2) = sqrt(2)
For week 3: sqrt((40-41)^2 + (6-7)^2) = sqrt(2)
The final value for distance between S1 and S2 = sqrt(2) which is calculated as
average distance of the 3 weeks i.e. (3 * sqrt(2)) / 3
最后的输出应该是这样的:
S1 S2 S3
S1 0 1.414 2.818
S2 1.414 0 some val
S3 2.818 some val 0
我对按函数对数据框中的列进行分组和 scipy.spatial.distance.cdist 计算欧氏距离有一些想法,但我无法将这些概念联系起来并提出解决方案。
我们可以pivot
然后使用numpy
进行这些计算
df1 = (df.pivot(index='Store', columns='Week', values=['Sales', 'Cust_count'])
# .fillna(0) # Uncomment if you want to treat missing store-weeks as 0s
)
arr1 = df1['Sales'].to_numpy()
arr2 = df1['Cust_count'].to_numpy()
data = np.nanmean(np.sqrt(((arr1[None, :] - arr1[:, None])**2
+ (arr2[None, :] - arr2[:, None])**2)),
axis=2)
pd.DataFrame(data, index=df1.index, columns = df1.index)
Store S1 S2 S3
Store
S1 0.000000 1.414214 2.828427
S2 1.414214 0.000000 1.414214
S3 2.828427 1.414214 0.000000
For 循环 permutations
import itertools
s=list(itertools.permutations(df.Store.unique(), 2))
from scipy import spatial
l=[]
for x in s:
l.append(np.sqrt(np.mean(np.sum((df[df.Store == x[0]].iloc[:, 2:].values - df[df.Store == x[1]].iloc[:, 2:].values)**2,axis=1),axis=0)))
s=pd.Series(l,index=pd.MultiIndex.from_tuples(s)).unstack()
s
Out[216]:
S1 S2 S3
S1 NaN 1.414214 2.828427
S2 1.414214 NaN 1.414214
S3 2.828427 1.414214 NaN
你可以先merge
on Week得到所有的店铺组合,然后用欧几里得距离计算列dist
,最后用aggfunc='mean'
计算pivot_table
:
df.merge(df, on='Week', how='left', suffixes=('','_'))\
.assign(dist = lambda x: np.sqrt((x.Sales - x.Sales_)**2 + (x.Cust_count - x.Cust_count_)**2))\
.pivot_table(index='Store', columns='Store_', values='dist', aggfunc='mean')
Store_ S1 S2 S3
Store
S1 0.000000 1.414214 2.828427
S2 1.414214 0.000000 1.414214
S3 2.828427 1.414214 0.000000
我有以下形式的各个商店的每周数据:
pd.DataFrame({'Store':['S1', 'S1', 'S1', 'S2','S2','S2','S3','S3','S3'], 'Week':[1, 2, 3,1,2,3,1,2,3],
'Sales' : [20,30,40,21,31,41,22,32,42],'Cust_count' : [2,4,6,3,5,7,4,6,8]})
Store Week Sales Cust_count
0 S1 1 20 2
1 S1 2 30 4
2 S1 3 40 6
3 S2 1 21 3
4 S2 2 31 5
5 S2 3 41 7
6 S3 1 22 4
7 S3 2 32 6
8 S3 3 42 8
如您所见,数据处于商店周级别,我想计算同一周每家商店之间的欧氏距离,然后取计算距离的平均值。例如,Store S1 和 S2 的计算如下所示:
For week 1: sqrt((20-21)^2 + (2-3)^2) = sqrt(2)
For week 2: sqrt((30-31)^2 + (4-5)^2) = sqrt(2)
For week 3: sqrt((40-41)^2 + (6-7)^2) = sqrt(2)
The final value for distance between S1 and S2 = sqrt(2) which is calculated as
average distance of the 3 weeks i.e. (3 * sqrt(2)) / 3
最后的输出应该是这样的:
S1 S2 S3
S1 0 1.414 2.818
S2 1.414 0 some val
S3 2.818 some val 0
我对按函数对数据框中的列进行分组和 scipy.spatial.distance.cdist 计算欧氏距离有一些想法,但我无法将这些概念联系起来并提出解决方案。
我们可以pivot
然后使用numpy
进行这些计算
df1 = (df.pivot(index='Store', columns='Week', values=['Sales', 'Cust_count'])
# .fillna(0) # Uncomment if you want to treat missing store-weeks as 0s
)
arr1 = df1['Sales'].to_numpy()
arr2 = df1['Cust_count'].to_numpy()
data = np.nanmean(np.sqrt(((arr1[None, :] - arr1[:, None])**2
+ (arr2[None, :] - arr2[:, None])**2)),
axis=2)
pd.DataFrame(data, index=df1.index, columns = df1.index)
Store S1 S2 S3
Store
S1 0.000000 1.414214 2.828427
S2 1.414214 0.000000 1.414214
S3 2.828427 1.414214 0.000000
For 循环 permutations
import itertools
s=list(itertools.permutations(df.Store.unique(), 2))
from scipy import spatial
l=[]
for x in s:
l.append(np.sqrt(np.mean(np.sum((df[df.Store == x[0]].iloc[:, 2:].values - df[df.Store == x[1]].iloc[:, 2:].values)**2,axis=1),axis=0)))
s=pd.Series(l,index=pd.MultiIndex.from_tuples(s)).unstack()
s
Out[216]:
S1 S2 S3
S1 NaN 1.414214 2.828427
S2 1.414214 NaN 1.414214
S3 2.828427 1.414214 NaN
你可以先merge
on Week得到所有的店铺组合,然后用欧几里得距离计算列dist
,最后用aggfunc='mean'
计算pivot_table
:
df.merge(df, on='Week', how='left', suffixes=('','_'))\
.assign(dist = lambda x: np.sqrt((x.Sales - x.Sales_)**2 + (x.Cust_count - x.Cust_count_)**2))\
.pivot_table(index='Store', columns='Store_', values='dist', aggfunc='mean')
Store_ S1 S2 S3
Store
S1 0.000000 1.414214 2.828427
S2 1.414214 0.000000 1.414214
S3 2.828427 1.414214 0.000000