计算 python 中只有列和多行的数据框的编辑距离
Compute Edit distance for a dataframe which has only column and multiple rows in python
我有一个数据框,其中包含一列和超过 2000 行。如何计算同一列各行之间的编辑距离
我的数据框如下所示:
Name
John
Mrinmayee
rituja
ritz
divya
priyanka
chetna
chetan
mansi
mansvi
mani
aliya
shelia
Dilip
Dilipa
我需要计算每一行之间的距离?我们如何做到这一点或实现这一目标。
我已经写了一些代码,但这不起作用..给出了无穷无尽的距离列表,我想我在 for 循环中出错了。有人可以帮忙吗
import pandas as pd
import numpy as np
import editdistance
data_dist = pd.read_csv(Data_TestDescription.csv')
df = pd.DataFrame(data_dist)
levdist =[]
for index, row in df.iterrows():
levdist = editdistance.eval(row,row)
print levdist
这是我学会礼貌的绝妙技巧 。可以使用itertools.product
,然后循环计算编辑距离
from itertools import product
dist = np.empty(df.shape[0]**2, dtype=int)
for i, x in enumerate(product(df.Name, repeat=2)):
dist[i] = editdistance.eval(*x)
dist_df = pd.DataFrame(dist.reshape(-1, df.shape[0]))
dist_df
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 8 6 4 5 7 5 5 5 6 4 5 6 5 6
1 8 0 7 7 7 6 8 8 7 8 7 7 8 8 8
2 6 7 0 3 4 5 5 6 6 6 6 5 5 5 4
3 4 7 3 0 4 6 5 5 5 6 4 4 6 4 5
4 5 7 4 4 0 6 5 5 5 6 5 3 5 4 4
5 7 6 5 6 6 0 6 6 6 7 6 5 7 7 6
6 5 8 5 5 5 6 0 2 6 6 5 5 3 6 5
7 5 8 6 5 5 6 2 0 6 6 5 5 4 6 6
8 5 7 6 5 5 6 6 6 0 1 1 5 5 5 6
9 6 8 6 6 6 7 6 6 1 0 2 5 6 6 6
10 4 7 6 4 5 6 5 5 1 2 0 4 5 4 5
11 5 7 5 4 3 5 5 5 5 5 4 0 4 4 3
12 6 8 5 6 5 7 3 4 5 6 5 4 0 4 4
13 5 8 5 4 4 7 6 6 5 6 4 4 4 0 1
14 6 8 4 5 4 6 5 6 6 6 5 3 4 1 0
np.empty
初始化一个空数组,然后通过每次调用 editdistance.eval
来填充该数组。
借鉴senderle's cartesian_product
,我们可以实现一些速度提升:
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
v = np.apply_along_axis(func1d=lambda x: editdistance.eval(*x),
arr=cartesian_product(df.Name, df.Name), axis=1).reshape(-1, df.shape[0])
dist_df = pd.DataFrame(v)
或者,您可以定义一个函数来计算编辑距离并将其矢量化:
def f(x, y):
return editdistance.eval(x, y)
v = np.vectorize(f)
arr = cartesian_product(df.Name, df.Name).T
arr = v(arr[0, :], arr[1, :])
dist_df = pd.DataFrame(arr.reshape(-1, df.shape[0]))
如果需要带注释的索引和列,在构造时添加即可 dist_df
:
dist_df = pd.DataFrame(..., index=df.Name, columns=df.Name)
dist_df
Name John Mrinmayee rituja ritz divya priyanka chetna chetan \
Name
John 0 8 6 4 5 7 5 5
Mrinmayee 8 0 7 7 7 6 8 8
rituja 6 7 0 3 4 5 5 6
ritz 4 7 3 0 4 6 5 5
divya 5 7 4 4 0 6 5 5
priyanka 7 6 5 6 6 0 6 6
chetna 5 8 5 5 5 6 0 2
chetan 5 8 6 5 5 6 2 0
mansi 5 7 6 5 5 6 6 6
mansvi 6 8 6 6 6 7 6 6
mani 4 7 6 4 5 6 5 5
aliya 5 7 5 4 3 5 5 5
shelia 6 8 5 6 5 7 3 4
Dilip 5 8 5 4 4 7 6 6
Dilipa 6 8 4 5 4 6 5 6
Name mansi mansvi mani aliya shelia Dilip Dilipa
Name
John 5 6 4 5 6 5 6
Mrinmayee 7 8 7 7 8 8 8
rituja 6 6 6 5 5 5 4
ritz 5 6 4 4 6 4 5
divya 5 6 5 3 5 4 4
priyanka 6 7 6 5 7 7 6
chetna 6 6 5 5 3 6 5
chetan 6 6 5 5 4 6 6
mansi 0 1 1 5 5 5 6
mansvi 1 0 2 5 6 6 6
mani 1 2 0 4 5 4 5
aliya 5 5 4 0 4 4 3
shelia 5 6 5 4 0 4 4
Dilip 5 6 4 4 4 0 1
Dilipa 6 6 5 3 4 1 0
我有一个数据框,其中包含一列和超过 2000 行。如何计算同一列各行之间的编辑距离
我的数据框如下所示:
Name
John
Mrinmayee
rituja
ritz
divya
priyanka
chetna
chetan
mansi
mansvi
mani
aliya
shelia
Dilip
Dilipa
我需要计算每一行之间的距离?我们如何做到这一点或实现这一目标。
我已经写了一些代码,但这不起作用..给出了无穷无尽的距离列表,我想我在 for 循环中出错了。有人可以帮忙吗
import pandas as pd
import numpy as np
import editdistance
data_dist = pd.read_csv(Data_TestDescription.csv')
df = pd.DataFrame(data_dist)
levdist =[]
for index, row in df.iterrows():
levdist = editdistance.eval(row,row)
print levdist
这是我学会礼貌的绝妙技巧 itertools.product
,然后循环计算编辑距离
from itertools import product
dist = np.empty(df.shape[0]**2, dtype=int)
for i, x in enumerate(product(df.Name, repeat=2)):
dist[i] = editdistance.eval(*x)
dist_df = pd.DataFrame(dist.reshape(-1, df.shape[0]))
dist_df
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 8 6 4 5 7 5 5 5 6 4 5 6 5 6
1 8 0 7 7 7 6 8 8 7 8 7 7 8 8 8
2 6 7 0 3 4 5 5 6 6 6 6 5 5 5 4
3 4 7 3 0 4 6 5 5 5 6 4 4 6 4 5
4 5 7 4 4 0 6 5 5 5 6 5 3 5 4 4
5 7 6 5 6 6 0 6 6 6 7 6 5 7 7 6
6 5 8 5 5 5 6 0 2 6 6 5 5 3 6 5
7 5 8 6 5 5 6 2 0 6 6 5 5 4 6 6
8 5 7 6 5 5 6 6 6 0 1 1 5 5 5 6
9 6 8 6 6 6 7 6 6 1 0 2 5 6 6 6
10 4 7 6 4 5 6 5 5 1 2 0 4 5 4 5
11 5 7 5 4 3 5 5 5 5 5 4 0 4 4 3
12 6 8 5 6 5 7 3 4 5 6 5 4 0 4 4
13 5 8 5 4 4 7 6 6 5 6 4 4 4 0 1
14 6 8 4 5 4 6 5 6 6 6 5 3 4 1 0
np.empty
初始化一个空数组,然后通过每次调用 editdistance.eval
来填充该数组。
借鉴senderle's cartesian_product
,我们可以实现一些速度提升:
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
v = np.apply_along_axis(func1d=lambda x: editdistance.eval(*x),
arr=cartesian_product(df.Name, df.Name), axis=1).reshape(-1, df.shape[0])
dist_df = pd.DataFrame(v)
或者,您可以定义一个函数来计算编辑距离并将其矢量化:
def f(x, y):
return editdistance.eval(x, y)
v = np.vectorize(f)
arr = cartesian_product(df.Name, df.Name).T
arr = v(arr[0, :], arr[1, :])
dist_df = pd.DataFrame(arr.reshape(-1, df.shape[0]))
如果需要带注释的索引和列,在构造时添加即可 dist_df
:
dist_df = pd.DataFrame(..., index=df.Name, columns=df.Name)
dist_df
Name John Mrinmayee rituja ritz divya priyanka chetna chetan \
Name
John 0 8 6 4 5 7 5 5
Mrinmayee 8 0 7 7 7 6 8 8
rituja 6 7 0 3 4 5 5 6
ritz 4 7 3 0 4 6 5 5
divya 5 7 4 4 0 6 5 5
priyanka 7 6 5 6 6 0 6 6
chetna 5 8 5 5 5 6 0 2
chetan 5 8 6 5 5 6 2 0
mansi 5 7 6 5 5 6 6 6
mansvi 6 8 6 6 6 7 6 6
mani 4 7 6 4 5 6 5 5
aliya 5 7 5 4 3 5 5 5
shelia 6 8 5 6 5 7 3 4
Dilip 5 8 5 4 4 7 6 6
Dilipa 6 8 4 5 4 6 5 6
Name mansi mansvi mani aliya shelia Dilip Dilipa
Name
John 5 6 4 5 6 5 6
Mrinmayee 7 8 7 7 8 8 8
rituja 6 6 6 5 5 5 4
ritz 5 6 4 4 6 4 5
divya 5 6 5 3 5 4 4
priyanka 6 7 6 5 7 7 6
chetna 6 6 5 5 3 6 5
chetan 6 6 5 5 4 6 6
mansi 0 1 1 5 5 5 6
mansvi 1 0 2 5 6 6 6
mani 1 2 0 4 5 4 5
aliya 5 5 4 0 4 4 3
shelia 5 6 5 4 0 4 4
Dilip 5 6 4 4 4 0 1
Dilipa 6 6 5 3 4 1 0