计算 python 中只有列和多行的数据框的编辑距离

Compute Edit distance for a dataframe which has only column and multiple rows in python

我有一个数据框,其中包含一列和超过 2000 行。如何计算同一列各行之间的编辑距离

我的数据框如下所示:

  Name
  John
  Mrinmayee
  rituja
  ritz
  divya
  priyanka
  chetna
  chetan
  mansi
  mansvi
  mani
  aliya
  shelia
  Dilip
  Dilipa

我需要计算每一行之间的距离?我们如何做到这一点或实现这一目标。

我已经写了一些代码,但这不起作用..给出了无穷无尽的距离列表,我想我在 for 循环中出错了。有人可以帮忙吗

   import pandas as pd
   import numpy as np
   import editdistance
   data_dist =  pd.read_csv(Data_TestDescription.csv')
   df = pd.DataFrame(data_dist)
   levdist =[]
   for index, row in df.iterrows():
        levdist = editdistance.eval(row,row)
        print levdist 

这是我学会礼貌的绝妙技巧 。可以使用itertools.product,然后循环计算编辑距离

from itertools import product

dist = np.empty(df.shape[0]**2, dtype=int) 
for i, x in enumerate(product(df.Name, repeat=2)): 
    dist[i] = editdistance.eval(*x)

dist_df = pd.DataFrame(dist.reshape(-1, df.shape[0]))

dist_df

    0   1   2   3   4   5   6   7   8   9   10  11  12  13  14
0    0   8   6   4   5   7   5   5   5   6   4   5   6   5   6
1    8   0   7   7   7   6   8   8   7   8   7   7   8   8   8
2    6   7   0   3   4   5   5   6   6   6   6   5   5   5   4
3    4   7   3   0   4   6   5   5   5   6   4   4   6   4   5
4    5   7   4   4   0   6   5   5   5   6   5   3   5   4   4
5    7   6   5   6   6   0   6   6   6   7   6   5   7   7   6
6    5   8   5   5   5   6   0   2   6   6   5   5   3   6   5
7    5   8   6   5   5   6   2   0   6   6   5   5   4   6   6
8    5   7   6   5   5   6   6   6   0   1   1   5   5   5   6
9    6   8   6   6   6   7   6   6   1   0   2   5   6   6   6
10   4   7   6   4   5   6   5   5   1   2   0   4   5   4   5
11   5   7   5   4   3   5   5   5   5   5   4   0   4   4   3
12   6   8   5   6   5   7   3   4   5   6   5   4   0   4   4
13   5   8   5   4   4   7   6   6   5   6   4   4   4   0   1
14   6   8   4   5   4   6   5   6   6   6   5   3   4   1   0

np.empty 初始化一个空数组,然后通过每次调用 editdistance.eval 来填充该数组。


借鉴senderle's cartesian_product,我们可以实现一些速度提升:

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = np.result_type(*arrays)
    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(np.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

v = np.apply_along_axis(func1d=lambda x: editdistance.eval(*x), 
           arr=cartesian_product(df.Name, df.Name), axis=1).reshape(-1, df.shape[0])

dist_df = pd.DataFrame(v)

或者,您可以定义一个函数来计算编辑距离并将其矢量化:

def f(x, y):
    return editdistance.eval(x, y)

v = np.vectorize(f)

arr = cartesian_product(df.Name, df.Name).T
arr = v(arr[0, :], arr[1, :])

dist_df = pd.DataFrame(arr.reshape(-1, df.shape[0]))

如果需要带注释的索引和列,在构造时添加即可 dist_df:

dist_df = pd.DataFrame(..., index=df.Name, columns=df.Name)

dist_df

Name       John  Mrinmayee  rituja  ritz  divya  priyanka  chetna  chetan  \
Name                                                                        
John          0          8       6     4      5         7       5       5   
Mrinmayee     8          0       7     7      7         6       8       8   
rituja        6          7       0     3      4         5       5       6   
ritz          4          7       3     0      4         6       5       5   
divya         5          7       4     4      0         6       5       5   
priyanka      7          6       5     6      6         0       6       6   
chetna        5          8       5     5      5         6       0       2   
chetan        5          8       6     5      5         6       2       0   
mansi         5          7       6     5      5         6       6       6   
mansvi        6          8       6     6      6         7       6       6   
mani          4          7       6     4      5         6       5       5   
aliya         5          7       5     4      3         5       5       5   
shelia        6          8       5     6      5         7       3       4   
Dilip         5          8       5     4      4         7       6       6   
Dilipa        6          8       4     5      4         6       5       6   

Name       mansi  mansvi  mani  aliya  shelia  Dilip  Dilipa  
Name                                                          
John           5       6     4      5       6      5       6  
Mrinmayee      7       8     7      7       8      8       8  
rituja         6       6     6      5       5      5       4  
ritz           5       6     4      4       6      4       5  
divya          5       6     5      3       5      4       4  
priyanka       6       7     6      5       7      7       6  
chetna         6       6     5      5       3      6       5  
chetan         6       6     5      5       4      6       6  
mansi          0       1     1      5       5      5       6  
mansvi         1       0     2      5       6      6       6  
mani           1       2     0      4       5      4       5  
aliya          5       5     4      0       4      4       3  
shelia         5       6     5      4       0      4       4  
Dilip          5       6     4      4       4      0       1  
Dilipa         6       6     5      3       4      1       0