如何为 python 中我的列中的行计算 Levenshtein ratio/distance?
How to calculate Levenshtein ratio/distance for rows in my column in python?
我有一个只有一列的数据框,该列有 1000 行。
我需要比较所有行并找到所有行的 Levenshtein 距离。我如何计算 python
中的比率或距离
我有一个数据框如下:
#Df
StepDescription
click confirm button when done
you have logged on
please log in to proceed
click on confirm button
Dolb was released successfully
Enter your details
validate the statement
Aval was released sucessfully
如何计算所有这些的编辑比例
我已经编写了代码来遍历循环,但是在迭代之后如何继续。
import Levenshtein
import pandas as pd
data_dist = pd.read_csv('path\Data_TestDescription.csv')
df = pd.DataFrame(data_dist)
for index, row in df.iterrows():
如评论中所问,百分比是需要的,我将保留已接受的答案并仅添加新部分:
import numpy as np
import pandas as pd
from Levenshtein import distance
from itertools import product
#df = ...
dist = [distance(*x) for x in product(df.StepDescription, repeat=2)]
dist_df = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]))
dist_df
0 1 2 3 4 5 6 7
0 0 23 23 13 29 25 25 28
1 23 0 18 18 23 18 18 23
2 23 18 0 20 25 21 19 24
3 13 18 20 0 27 19 21 26
4 29 23 25 27 0 26 23 5
5 25 18 21 19 26 0 19 25
6 25 18 19 21 23 19 0 21
7 28 23 24 26 5 25 21 0
dist_df_percentage = dist_df // min(x for x in dist if x > 0) * 100
0 1 2 3 4 5 6 7
0 0 460 460 260 580 500 500 560
1 460 0 360 360 460 360 360 460
2 460 360 0 400 500 420 380 480
3 260 360 400 0 540 380 420 520
4 580 460 500 540 0 520 460 100
5 500 360 420 380 520 0 380 500
6 500 360 380 420 460 380 0 420
7 560 460 480 520 100 500 420 0
最后,在尝试了很多示例之后,我使用 fuzzratio 得到了准确的比率或百分比
from itertools import product
import numpy as np
import difflib
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import editdistance
dist = np.empty(df.shape[0]**2, dtype=int)
for i, x in enumerate(product(df.Stepdescription, repeat=2)):
dist[i] = fuzz.ratio(*x)
dist_df = pd.DataFrame(dist.reshape(-1, df.shape[0]))
out_csv= dist_df.to_csv('FuzzyRatio.csv', sep='\t')
我有一个只有一列的数据框,该列有 1000 行。 我需要比较所有行并找到所有行的 Levenshtein 距离。我如何计算 python
中的比率或距离我有一个数据框如下:
#Df
StepDescription
click confirm button when done
you have logged on
please log in to proceed
click on confirm button
Dolb was released successfully
Enter your details
validate the statement
Aval was released sucessfully
如何计算所有这些的编辑比例
我已经编写了代码来遍历循环,但是在迭代之后如何继续。
import Levenshtein
import pandas as pd
data_dist = pd.read_csv('path\Data_TestDescription.csv')
df = pd.DataFrame(data_dist)
for index, row in df.iterrows():
如评论中所问,百分比是需要的,我将保留已接受的答案并仅添加新部分:
import numpy as np
import pandas as pd
from Levenshtein import distance
from itertools import product
#df = ...
dist = [distance(*x) for x in product(df.StepDescription, repeat=2)]
dist_df = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]))
dist_df
0 1 2 3 4 5 6 7
0 0 23 23 13 29 25 25 28
1 23 0 18 18 23 18 18 23
2 23 18 0 20 25 21 19 24
3 13 18 20 0 27 19 21 26
4 29 23 25 27 0 26 23 5
5 25 18 21 19 26 0 19 25
6 25 18 19 21 23 19 0 21
7 28 23 24 26 5 25 21 0
dist_df_percentage = dist_df // min(x for x in dist if x > 0) * 100
0 1 2 3 4 5 6 7
0 0 460 460 260 580 500 500 560
1 460 0 360 360 460 360 360 460
2 460 360 0 400 500 420 380 480
3 260 360 400 0 540 380 420 520
4 580 460 500 540 0 520 460 100
5 500 360 420 380 520 0 380 500
6 500 360 380 420 460 380 0 420
7 560 460 480 520 100 500 420 0
最后,在尝试了很多示例之后,我使用 fuzzratio 得到了准确的比率或百分比
from itertools import product
import numpy as np
import difflib
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import editdistance
dist = np.empty(df.shape[0]**2, dtype=int)
for i, x in enumerate(product(df.Stepdescription, repeat=2)):
dist[i] = fuzz.ratio(*x)
dist_df = pd.DataFrame(dist.reshape(-1, df.shape[0]))
out_csv= dist_df.to_csv('FuzzyRatio.csv', sep='\t')