为 pandas 列中的每个单词计算 1 的单词距离数
Compute number of word distance of 1 for every word in pandas column
对于列表中的每个字符串,我需要找出该列表中相距一个编辑距离的字符串的数量。编辑距离是从一个词派生出另一个词所需的字符替换、添加或删除的最小数量。为了说明,请参见以下 DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'word':['can', 'cans', 'canse', 'canpe', 'canp', 'camp'],
'code':['k@n', 'k@n}', 'k@(z', np.nan, 'k@()', np.nan]})
word code
0 can k@n
1 cans k@n}
2 canse k@(z
3 canpe
4 canp k@()
5 camp
我目前的实施速度太慢了:
from Levenshtein import distance as lev
df = df.fillna('')
# get unique strings
wordAll = df['word'].dropna().to_list()
codeAll = list(set(df['code'].dropna().to_list()))
# prepare dataframe for storage
df['wordLev'] = np.nan
df['codeLev'] = np.nan
# find neighbors
for idx,row in df.iterrows():
i=0
j=0
# get word and code
word = row['word']
code = row['code']
# remove word and code from all-strings-list
wordSubset = [w for w in wordAll if w != word]
codeSubset = [c for c in codeAll if c != code]
# compute number of neighbors
for item in wordSubset:
if lev(word, item) == 1:
i += 1
for item in codeSubset:
if lev(code, item) == 1:
j += 1
# add number of neighbors to df
df.loc[df['code'] == code, 'wordLev'] = i
if code != '':
df.loc[df['code'] == code, 'codeLev'] = j
else:
df.loc[df['code'] == code, 'codeLev'] = ''
df
word code wordLev codeLev
0 can k@n 2 1
1 cans k@n} 3 1
2 canse k@(z 2 1
3 canpe 2
4 canp k@() 3 1
5 camp 1
我怎样才能加快速度? DataFrame 有大约 50 万行...
以下代码似乎比您的代码快 5 倍,分别为 1.8 毫秒和 9.6 毫秒(至少在您提供的 df
上是这样)。
df = df.fillna('')
df['wordLev'] = [sum(1 for item in df['word'] if item!=word and lev(word, item)==1) for word in df['word']]
df['codeLev'] = [sum(1 for item in df['code'] if item!=code and lev(code, item)==1) or '' for code in df['code']]
这段代码和你的非常相似。最大的区别在于,它不是创建 wordSubset
或 codeSubset
然后再次迭代它们以应用 levenshtein 距离函数,而是在单个生成器表达式中全部完成。由于您要检查列中的每个单词和每个单词,因此您无法避免双循环 imo。
对于列表中的每个字符串,我需要找出该列表中相距一个编辑距离的字符串的数量。编辑距离是从一个词派生出另一个词所需的字符替换、添加或删除的最小数量。为了说明,请参见以下 DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'word':['can', 'cans', 'canse', 'canpe', 'canp', 'camp'],
'code':['k@n', 'k@n}', 'k@(z', np.nan, 'k@()', np.nan]})
word code
0 can k@n
1 cans k@n}
2 canse k@(z
3 canpe
4 canp k@()
5 camp
我目前的实施速度太慢了:
from Levenshtein import distance as lev
df = df.fillna('')
# get unique strings
wordAll = df['word'].dropna().to_list()
codeAll = list(set(df['code'].dropna().to_list()))
# prepare dataframe for storage
df['wordLev'] = np.nan
df['codeLev'] = np.nan
# find neighbors
for idx,row in df.iterrows():
i=0
j=0
# get word and code
word = row['word']
code = row['code']
# remove word and code from all-strings-list
wordSubset = [w for w in wordAll if w != word]
codeSubset = [c for c in codeAll if c != code]
# compute number of neighbors
for item in wordSubset:
if lev(word, item) == 1:
i += 1
for item in codeSubset:
if lev(code, item) == 1:
j += 1
# add number of neighbors to df
df.loc[df['code'] == code, 'wordLev'] = i
if code != '':
df.loc[df['code'] == code, 'codeLev'] = j
else:
df.loc[df['code'] == code, 'codeLev'] = ''
df
word code wordLev codeLev
0 can k@n 2 1
1 cans k@n} 3 1
2 canse k@(z 2 1
3 canpe 2
4 canp k@() 3 1
5 camp 1
我怎样才能加快速度? DataFrame 有大约 50 万行...
以下代码似乎比您的代码快 5 倍,分别为 1.8 毫秒和 9.6 毫秒(至少在您提供的 df
上是这样)。
df = df.fillna('')
df['wordLev'] = [sum(1 for item in df['word'] if item!=word and lev(word, item)==1) for word in df['word']]
df['codeLev'] = [sum(1 for item in df['code'] if item!=code and lev(code, item)==1) or '' for code in df['code']]
这段代码和你的非常相似。最大的区别在于,它不是创建 wordSubset
或 codeSubset
然后再次迭代它们以应用 levenshtein 距离函数,而是在单个生成器表达式中全部完成。由于您要检查列中的每个单词和每个单词,因此您无法避免双循环 imo。