为 pandas 列中的每个单词计算 1 的单词距离数

Compute number of word distance of 1 for every word in pandas column

对于列表中的每个字符串,我需要找出该列表中相距一个编辑距离的字符串的数量。编辑距离是从一个词派生出另一个词所需的字符替换、添加或删除的最小数量。为了说明,请参见以下 DataFrame:

import pandas as pd
import numpy as np
df = pd.DataFrame({
'word':['can', 'cans', 'canse', 'canpe', 'canp', 'camp'],
'code':['k@n', 'k@n}', 'k@(z', np.nan, 'k@()', np.nan]})

  word  code
0 can    k@n
1 cans  k@n}
2 canse k@(z
3 canpe
4 canp  k@()
5 camp 

我目前的实施速度太慢了:

from Levenshtein import distance as lev

df = df.fillna('')

# get unique strings
wordAll = df['word'].dropna().to_list()
codeAll = list(set(df['code'].dropna().to_list()))

# prepare dataframe for storage
df['wordLev'] = np.nan
df['codeLev'] = np.nan

# find neighbors
for idx,row in df.iterrows():
    i=0
    j=0

    # get word and code
    word = row['word']
    code = row['code']

    # remove word and code from all-strings-list
    wordSubset = [w for w in wordAll if w != word]
    codeSubset = [c for c in codeAll if c != code]

    # compute number of neighbors
    for item in wordSubset:
        if lev(word, item) == 1:
            i += 1
    for item in codeSubset:
        if lev(code, item) == 1:
            j += 1

    # add number of neighbors to df
    df.loc[df['code'] == code, 'wordLev'] = i
    if code != '':
        df.loc[df['code'] == code, 'codeLev'] = j
    else:
        df.loc[df['code'] == code, 'codeLev'] = ''

df

  word  code wordLev codeLev
0 can    k@n       2       1  
1 cans  k@n}       3       1
2 canse k@(z       2       1
3 canpe            2
4 canp  k@()       3       1
5 camp             1

我怎样才能加快速度? DataFrame 有大约 50 万行...

以下代码似乎比您的代码快 5 倍,分别为 1.8 毫秒和 9.6 毫秒(至少在您提供的 df 上是这样)。

df = df.fillna('')
df['wordLev'] = [sum(1 for item in df['word'] if item!=word and lev(word, item)==1) for word in df['word']]
df['codeLev'] = [sum(1 for item in df['code'] if item!=code and lev(code, item)==1) or '' for code in df['code']]

这段代码和你的非常相似。最大的区别在于,它不是创建 wordSubsetcodeSubset 然后再次迭代它们以应用 levenshtein 距离函数,而是在单个生成器表达式中全部完成。由于您要检查列中的每个单词和每个单词,因此您无法避免双循环 imo。