跳过对自身的迭代(列表)

Skip iterating on itself (List)

我正在尝试在列表中查找类似的电子邮件。为此,

database.head()

    TID PID Names
0   22330   134575  tim
1   22333   134578  tim.rand
2   22328   134571  rand.001
3   22340   134568  pankit090
4   22325   134569  timrook

emailsdb = database['Names'].values.tolist() 现在迭代部分

list = []
for email in emailsdb :
    result = process.extractBests(email, emailsdb, score_cutoff=85, limit=100)
    list.append(result)

列表输出为

[[('tim', 100), ('tim.rand', 90), ('timrook', 90)],
 [('tim.rand', 100), ('tim', 90)],
 [('rand.001', 100)],
 [('pankit090', 100),
  ('pankit001', 89),
  ('pankit002', 89),
  ('pankit003', 89),
  ('pankit004', 89),
  ('pankit005', 89)],
 [('timrook', 100), ('tim', 90)],
 [('pankit001', 100),
  ('pankit090', 89),
  ('pankit002', 89),
  ('pankit003', 89),
  ('pankit004', 89),
  ('pankit005', 89)],
 [('pankit002', 100),
  ('pankit090', 89),
  ('pankit001', 89),
  ('pankit003', 89),
  ('pankit004', 89),
  ('pankit005', 89)],

但我想避免 ('tim', 100), ('tim.rand', 100), ('rand.001', 100), ('pankit090', 100), ('timrook', 100), ('pankit001', 100),('pankit002', 100) 因为这些显然是绝配

下面是处理模糊查找然后删除原始电子邮件列表结果的代码。由于这是一个短列表,删除匹配列表与清除原始列表具有相同的效果。请注意,删除过程可能包含在第一个电子邮件循环中,但为了清楚起见,我将它们分开。我添加了 zzzz 条目来检查得分。

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# original email list
emailsdb = [
'tim',
'tim.rand',
'rand.001',
'pankit090',
'timrook',
'pankit001',
'pankit002',
'pankit003',
'pankit004',
'pankit005',
'zzzz'
]

print('emailsdb:', emailsdb)

# do fuzzy search and score other emails (including self)
lstres = []
for email in emailsdb :
    result = process.extractBests(email, emailsdb, score_cutoff=85, limit=50)
    lstres.append(result)
    
print('WuzzyResult:', lstres)

# scan match list and create set 
setremove = set()  # no duplicates
for r in lstres:
   setremove |= ({t[0] for t in r})
   
print('setremove:', setremove)

# remove matches from original set
for e in setremove:
   emailsdb.remove(e)
   
print('emailsdb:', emailsdb)

输出:

emailsdb: ['tim', 'tim.rand', 'rand.001', 'pankit090', 'timrook', 'pankit001', 'pankit002', 'pankit003', 'pankit004', 'pankit005', 'zzzz']

WuzzyResult: [
[('tim', 100), ('tim.rand', 90), ('timrook', 90)], 
[('tim.rand', 100), ('tim', 90)], [('rand.001', 100)], 
[('pankit090', 100), ('pankit001', 89), ('pankit002', 89), ('pankit003', 89), ('pankit004', 89), ('pankit005', 89)], 
[('timrook', 100), ('tim', 90)], 
[('pankit001', 100), ('pankit090', 89), ('pankit002', 89), ('pankit003', 89), ('pankit004', 89), ('pankit005', 89)], 
[('pankit002', 100), ('pankit090', 89), ('pankit001', 89), ('pankit003', 89), ('pankit004', 89), ('pankit005', 89)], 
[('pankit003', 100), ('pankit090', 89), ('pankit001', 89), ('pankit002', 89), ('pankit004', 89), ('pankit005', 89)], 
[('pankit004', 100), ('pankit090', 89), ('pankit001', 89), ('pankit002', 89), ('pankit003', 89), ('pankit005', 89)], 
[('pankit005', 100), ('pankit090', 89), ('pankit001', 89), ('pankit002', 89), ('pankit003', 89), ('pankit004', 89)], 
[('zzzz', 100)]
]

setremove: {'pankit002', 'pankit005', 'tim', 'rand.001', 'pankit003', 'zzzz', 'pankit090', 'pankit001', 'timrook', 'pankit004', 'tim.rand'}

emailsdb: []

您可以删除不想比较的项目(可能是 pre-defined 知识)。这可以通过弹出这些项目然后比较它们,然后将它们插入回来来完成。

这应该可以解决您的部分问题。我简单地 pop 比较之前列表中的相同元素,然后 insert 它在下一个循环之前返回。检查打印输出以获取更多详细信息 -

import pandas as pd
from fuzzywuzzy import fuzz, process

emailsdb = list(df['Names'])

l = []
for i in range(len(emailsdb)) : 
    email = emailsdb[i] #Popping based on index, rather than comparing email in emailsdb
    emailsdb.pop(i) #Pop the same email string in the list of emails
    print("comparing:",[email],'->',emailsdb)
    result = process.extractBests(email, emailsdb, score_cutoff=85, limit=100)
    l.append(result)
    emailsdb.insert(i, email) #Insert it back

print(l)
comparing: ['tim'] -> ['tim.rand', 'rand.001', 'pankit090', 'timrook']
comparing: ['tim.rand'] -> ['tim', 'rand.001', 'pankit090', 'timrook']
comparing: ['rand.001'] -> ['tim', 'tim.rand', 'pankit090', 'timrook']
comparing: ['pankit090'] -> ['tim', 'tim.rand', 'rand.001', 'timrook']
comparing: ['timrook'] -> ['tim', 'tim.rand', 'rand.001', 'pankit090']

[[('tim.rand', 90), ('timrook', 90)], [('tim', 90)], [], [], [('tim', 90)]]

如您所见,冗余比较不是输出的一部分。

Notice, I am NOT COMPARING the email that is there in the email list to the whole email list for removing it. I am simply removing the 0th, 1st, 2nd ... index email from the email list, utilizing the sequential nature of the iteration to remove the email. This is a significant speedup.

您可以修改此方法以处理您已知相似且无需比较的项目集。

P.S:拜托拜托拜托拜托拜托拜托了,不要使用list作为变量名:)

根据我们在聊天中的讨论,将此作为另一个答案发布,因为它以完全不同的方式解决了这个问题。

接近-

  1. 如果您使用交换函数来获取字符串匹配,那么您实际上只能计算距离矩阵的下三角矩阵,因为 f(x,y) 将与 f(y ,x).

  2. 模糊比不可交换,而 Levenshtein 距离(又名编辑距离,它是模糊匹配的基础)是可交换的。在这里不是获得最高分,而是找到它们之间具有最小 lev 距离的字符串

  3. 因此,首先获取下三角矩阵的索引并迭代它们以计算具有这些索引的电子邮件的 lev 距离。你不要在距离矩阵 f(x,x) 的对角线上迭代,因为它很简单,后来你将它设置为一个像 999 这样的高值,这样你就可以在每一行中找到最小分值。

  4. 这为您提供了距离矩阵(请参阅输出)。接下来,对于每一行(电子邮件),您找到最近的字符串(最小距离)并将其创建为最佳匹配的元组(参见输出)

  5. 现在,这里的问题是,即使字符串的 NONE 匹配得很好,它仍然会贪婪地获得最近的一个。

  6. 所以最后你可以在这些字符串之间取 fuzz.ratio 并得到模糊分数,你可以过滤。所以现在,您可以避免垃圾匹配,只获得真正匹配超过特定阈值的匹配。这给出了最终匹配字符串(见输出)

这应该是您从算法角度优化代码所能做的最好的事情了。

import pandas as pd
import numpy as np

#!pip install python-Levenshtein
from Levenshtein import distance
from fuzzywuzzy import fuzz, process

#CHECKING COMMUTATIVE PROPERTY
distance('abcd','bdca') == distance('bdca', 'abcd')
###OUTPUT - TRUE

fuzz.ratio('abcd','bdca') == fuzz.ratio('bdca', 'abcd')
###OUTPUT - FALSE


#Create dummy matrix
m = np.zeros((len(emailsdb),len(emailsdb)))

#Get lower triangular matrix indices
i,j = np.tril_indices(len(emailsdb))

#iterate over lower triangular and get Levenshtein distance for lower triangular
for k in zip(i,j):
    if k[0]!=k[1]:
        m[k] = distance(emailsdb[k[0]], emailsdb[k[1]])
    else:
        m[k] = 0

#Copy lower triangular to upper traingular thanks to COMMUTATIVE PROPERTY
m = m + m.T

#Filling diagonal with 999 so that same string doesnt show up with minimum distance
np.fill_diagonal(m, 999)
print("distance matrix = ")
print(m)

#Get best matches with minimum distance
matches = list(zip([i for i in emailsdb], [emailsdb[i] for i in np.argmin(m, axis=0)]))

print("Best matches = ")
print(matches)


#OPTIONAL -> Filtering out irrelavant matches based on fuzzy match
f = [(*i,fuzz.ratio(*i)) for i in matches if fuzz.ratio(*i)>50]

print("final matching strings - ")
print(f)
distance matrix = 
[[999.   5.   8.   8.   4.]
 [  5. 999.   8.   9.   4.]
 [  8.   8. 999.   6.   8.]
 [  8.   9.   6. 999.   9.]
 [  4.   4.   8.   9. 999.]]


Best matches = 
[('tim', 'timrook'), ('tim.rand', 'timrook'), ('rand.001', 'pankit090'), ('pankit090', 'rand.001'), ('timrook', 'tim')]


final matching strings - 
[('tim', 'timrook', 60), ('tim.rand', 'timrook', 53), ('timrook', 'tim', 60)]