使用 fuzzywuzzy 将列中的所有字符串与第一个字符串进行比较的相似性分数

Similarity score to compare all strings in column to first string using fuzzywuzzy

我有一个包含大量对象 (unit) 列表的时间序列的数据集,我需要将每个对象的列表与每个对象的第一个列表进行比较。为此,我一直在使用 fuzzywuzzy 及其 similarity 方法,但我并没有真正将所有后来的实例(列表)与每个对象的第一个实例进行比较。为了使这一点更容易理解,让我们看看我到目前为止所取得的成就。注意:我是 fuzzywuzzy.

的新手

我的数据框是这样的形式:

data = {'unit': {59: 'unit1',
  662: 'unit1',
  680: 'unit1',
  725: 'unit1',
  709: 'unit1',
  703: 'unit1',
  653: 'unit1',
  807: 'unit4',
  825: 'unit4',
  778: 'unit4',
  816: 'unit4',
  822: 'unit4',
  849: 'unit4',
  820: 'unit4',
  754: 'unit4',
  1031: 'unit3',
  1094: 'unit2',
  1008: 'unit2',
  1089: 'unit2',
  1044: 'unit5'},
 'Date_job': {59: datetime.date(2021, 6, 7),
  662: datetime.date(2021, 6, 14),
  680: datetime.date(2021, 7, 5),
  725: datetime.date(2021, 7, 26),
  709: datetime.date(2021, 8, 30),
  703: datetime.date(2021, 10, 11),
  653: datetime.date(2021, 10, 18),
  807: datetime.date(2021, 7, 19),
  825: datetime.date(2021, 7, 26),
  778: datetime.date(2021, 8, 23),
  816: datetime.date(2021, 8, 30),
  822: datetime.date(2021, 9, 6),
  849: datetime.date(2021, 9, 27),
  820: datetime.date(2021, 10, 4),
  754: datetime.date(2021, 10, 18),
  1031: datetime.date(2021, 9, 6),
  1094: datetime.date(2021, 7, 26),
  1008: datetime.date(2021, 8, 9),
  1089: datetime.date(2021, 10, 4),
  1044: datetime.date(2021, 6, 14)},
 'Vector': {59: ['A|14:1/9.0',
   'A|15:1/11.0',
   'A|16:1/12.0',
   'B|11:1/4.0',
   'B|2:1/3.0',
   'B|3:1/12.0',
   'B|4:1/12.0',
   'B|5:1/9.0',
   'B|6:1/5.0',
   'B|7:1/5.0'],
  662: ['A|14:1/9.0',
   'A|15:1/11.0',
   'A|16:1/12.0',
   'B|11:1/4.0',
   'B|3:1/12.0',
   'B|4:1/12.0',
   'B|5:1/9.0',
   'B|5:1/8.0',
   'B|6:1/5.0',
   'B|7:1/5.0'],
  680: ['A|14:1/9.0',
   'A|14:1/4.0',
   'A|15:1/11.0',
   'A|16:1/12.0',
   'B|11:1/4.0',
   'B|3:1/12.0',
   'B|4:1/12.0',
   'B|5:1/9.0',
   'B|6:1/5.0',
   'B|7:1/5.0'],
  725: ['A|14:1/9.0',
   'A|15:1/11.0',
   'A|16:1/12.0',
   'B|11:1/4.0',
   'B|2:1/3.0',
   'B|3:1/12.0',
   'B|4:1/12.0',
   'B|5:1/9.0',
   'B|6:1/5.0',
   'B|7:1/5.0'],
  709: ['A|14:1/9.0',
   'A|15:1/11.0',
   'A|16:1/12.0',
   'B|11:1/4.0',
   'B|2:1/3.0',
   'B|3:1/12.0',
   'B|4:1/12.0',
   'B|5:1/9.0',
   'B|6:1/5.0',
   'B|7:1/5.0'],
  703: ['A|14:1/9.0',
   'A|15:1/11.0',
   'A|16:1/12.0',
   'B|11:1/4.0',
   'B|2:1/4.0',
   'B|3:1/12.0',
   'B|4:1/12.0',
   'B|5:1/9.0',
   'B|6:1/6.0',
   'B|7:1/5.0'],
  653: ['A|14:1/9.0',
   'A|15:1/11.0',
   'A|16:1/12.0',
   'B|11:1/4.0',
   'B|2:1/4.0',
   'B|3:1/12.0',
   'B|4:1/12.0',
   'B|5:1/9.0',
   'B|6:1/6.0',
   'B|7:1/5.0'],
  807: ['A|10:1/13.0',
   'A|10:1/13.0',
   'A|3:1/6.0',
   'A|3:1/6.0',
   'A|4:1/2.0',
   'A|5:1/2.0',
   'A|6:1/5.0',
   'A|6:1/5.0',
   'A|7:1/10.0',
   'A|7:1/10.0'],
  825: ['A|10:1/13.0',
   'A|10:1/13.0',
   'A|3:1/6.0',
   'A|3:1/6.0',
   'A|5:1/2.0',
   'A|5:1/2.0',
   'A|6:1/5.0',
   'A|6:1/5.0',
   'A|7:1/10.0',
   'A|7:1/10.0'],
  778: ['A|10:1/13.0',
   'A|10:1/13.0',
   'A|3:1/6.0',
   'A|3:1/6.0',
   'A|5:1/2.0',
   'A|6:1/5.0',
   'A|6:1/5.0',
   'A|7:1/10.0',
   'A|7:1/10.0',
   'A|8:1/7.0'],
  816: ['A|10:1/13.0',
   'A|10:1/13.0',
   'A|3:1/6.0',
   'A|3:1/6.0',
   'A|5:1/2.0',
   'A|6:1/5.0',
   'A|6:1/4.0',
   'A|7:1/10.0',
   'A|7:1/10.0',
   'A|8:1/7.0'],
  822: ['A|10:1/13.0',
   'A|10:1/13.0',
   'A|3:1/6.0',
   'A|3:1/6.0',
   'A|5:1/2.0',
   'A|5:1/2.0',
   'A|6:1/5.0',
   'A|6:1/4.0',
   'A|7:1/10.0',
   'A|7:1/10.0'],
  849: ['A|10:1/13.0',
   'A|10:1/13.0',
   'A|3:1/6.0',
   'A|3:1/6.0',
   'A|5:1/3.0',
   'A|5:1/2.0',
   'A|6:1/5.0',
   'A|6:1/5.0',
   'A|7:1/10.0',
   'A|7:1/10.0'],
  820: ['A|10:1/13.0',
   'A|10:1/13.0',
   'A|3:1/6.0',
   'A|3:1/6.0',
   'A|5:1/5.0',
   'A|5:1/2.0',
   'A|6:1/5.0',
   'A|6:1/5.0',
   'A|7:1/10.0',
   'A|7:1/10.0'],
  754: ['A|10:1/13.0',
   'A|10:1/13.0',
   'A|3:1/6.0',
   'A|3:1/6.0',
   'A|5:1/3.0',
   'A|5:1/2.0',
   'A|6:1/5.0',
   'A|6:1/5.0',
   'A|7:1/10.0',
   'A|7:1/10.0'],
  1031: ['A|10:1/7.0',
   'A|12:1/2.0',
   'A|5:1/10.0',
   'A|5:1/2.0',
   'A|6:1/12.0',
   'A|6:1/11.0',
   'A|6:1/4.0',
   'A|7:1/9.0',
   'A|7:1/6.0',
   'A|9:1/2.0'],
  1094: ['A|10:1/7.0',
   'A|12:1/2.0',
   'A|5:1/9.0',
   'A|6:1/11.0',
   'A|6:1/4.0',
   'A|7:1/9.0',
   'A|7:1/4.0',
   'A|8:1/4.0',
   'A|8:1/3.0',
   'A|9:1/2.0'],
  1008: ['A|10:1/7.0',
   'A|12:1/2.0',
   'A|5:1/9.0',
   'A|5:1/4.0',
   'A|6:1/11.0',
   'A|6:1/4.0',
   'A|7:1/9.0',
   'A|7:1/9.0',
   'A|8:1/4.0',
   'A|9:1/2.0'],
  1089: ['A|10:1/7.0',
   'A|12:1/2.0',
   'A|5:1/9.0',
   'A|5:1/2.0',
   'A|6:1/11.0',
   'A|6:1/6.0',
   'A|7:1/9.0',
   'A|7:1/3.0',
   'A|8:1/4.0',
   'A|9:1/2.0'],
  1044: ['A|10:1/6.0',
   'A|10:1/6.0',
   'A|5:1/4.0',
   'A|5:1/4.0',
   'A|6:1/10.0',
   'A|6:1/9.0',
   'A|6:1/9.0',
   'A|7:1/8.0',
   'A|7:1/8.0',
   'A|8:1/3.0']}}

由于fuzzywuzzy不接受列表作为输入,我需要将列表转换为字符串:

df = pd.DataFrame(data)
df['Vector_string'] = df['Vector'].astype(str)

给出:

unit    Date_job                                                                                                                   Vector                                                                                                                                Vector_string
59    unit1  2021-06-07  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/3.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0'] 
662   unit1  2021-06-14  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|5:1/8.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|5:1/8.0', 'B|6:1/5.0', 'B|7:1/5.0'] 
680   unit1  2021-07-05  [A|14:1/9.0, A|14:1/4.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]  ['A|14:1/9.0', 'A|14:1/4.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']
725   unit1  2021-07-26  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/3.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0'] 
709   unit1  2021-08-30  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/3.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0'] 
703   unit1  2021-10-11  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/6.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/6.0', 'B|7:1/5.0'] 
653   unit1  2021-10-18  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/6.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/6.0', 'B|7:1/5.0'] 
807   unit4  2021-07-19  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|4:1/2.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|4:1/2.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']   
825   unit4  2021-07-26  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/2.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']   
778   unit4  2021-08-23  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0, A|8:1/7.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0', 'A|8:1/7.0']   
816   unit4  2021-08-30  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/2.0, A|6:1/5.0, A|6:1/4.0, A|7:1/10.0, A|7:1/10.0, A|8:1/7.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/4.0', 'A|7:1/10.0', 'A|7:1/10.0', 'A|8:1/7.0']   
822   unit4  2021-09-06  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/2.0, A|5:1/2.0, A|6:1/5.0, A|6:1/4.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/4.0', 'A|7:1/10.0', 'A|7:1/10.0']   
849   unit4  2021-09-27  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/3.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/3.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']   
820   unit4  2021-10-04  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/5.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/5.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']   
754   unit4  2021-10-18  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/3.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/3.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']   
1031  unit3  2021-09-06  [A|10:1/7.0, A|12:1/2.0, A|5:1/10.0, A|5:1/2.0, A|6:1/12.0, A|6:1/11.0, A|6:1/4.0, A|7:1/9.0, A|7:1/6.0, A|9:1/2.0]      ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/10.0', 'A|5:1/2.0', 'A|6:1/12.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/6.0', 'A|9:1/2.0']    
1094  unit2  2021-07-26  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|6:1/11.0, A|6:1/4.0, A|7:1/9.0, A|7:1/4.0, A|8:1/4.0, A|8:1/3.0, A|9:1/2.0]        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/4.0', 'A|8:1/4.0', 'A|8:1/3.0', 'A|9:1/2.0']      
1008  unit2  2021-08-09  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|5:1/4.0, A|6:1/11.0, A|6:1/4.0, A|7:1/9.0, A|7:1/9.0, A|8:1/4.0, A|9:1/2.0]        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/4.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/9.0', 'A|8:1/4.0', 'A|9:1/2.0']      
1089  unit2  2021-10-04  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|5:1/2.0, A|6:1/11.0, A|6:1/6.0, A|7:1/9.0, A|7:1/3.0, A|8:1/4.0, A|9:1/2.0]        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/2.0', 'A|6:1/11.0', 'A|6:1/6.0', 'A|7:1/9.0', 'A|7:1/3.0', 'A|8:1/4.0', 'A|9:1/2.0']      
1044  unit5  2021-06-14  [A|10:1/6.0, A|10:1/6.0, A|5:1/4.0, A|5:1/4.0, A|6:1/10.0, A|6:1/9.0, A|6:1/9.0, A|7:1/8.0, A|7:1/8.0, A|8:1/3.0]        ['A|10:1/6.0', 'A|10:1/6.0', 'A|5:1/4.0', 'A|5:1/4.0', 'A|6:1/10.0', 'A|6:1/9.0', 'A|6:1/9.0', 'A|7:1/8.0', 'A|7:1/8.0', 'A|8:1/3.0']      

现在,我将字符串 Vector_string 个实例相互比较(对于每个单元)的操作如下:

from fuzzywuzzy import process, fuzz

UNITS = list(set(df.unit.unique()))
fre = []
for unit in UNITS:
    d = df[df['unit']==unit]
    d = d.reset_index()
    if len(d)>1:
        d2 = pd.DataFrame([process.extract(d['Vector_string'][i], d[~d.index.isin([i])]['Vector_string'], limit=1)[0] for i in range(len(d))],
                   index=d.index, columns=['match_Vector', 'match_percent', 'match_index'])
    else:
        0
    final = d.join(d2)
    fre.append(final)
    
dff = pd.concat(fre)

dff = dff.sort_values(['unit','Date_job'])

哪个returns:

index   unit    Date_job                                                                                                                   Vector                                                                                                                                Vector_string                                                                                                                                 match_Vector  match_percent  match_index
0  59     unit1  2021-06-07  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/3.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']   100            3          
1  662    unit1  2021-06-14  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|5:1/8.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|5:1/8.0', 'B|6:1/5.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|14:1/4.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  95             2          
2  680    unit1  2021-07-05  [A|14:1/9.0, A|14:1/4.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]  ['A|14:1/9.0', 'A|14:1/4.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']   95             0          
3  725    unit1  2021-07-26  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/3.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']   100            0          
4  709    unit1  2021-08-30  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/3.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']   100            0          
5  703    unit1  2021-10-11  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/6.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/6.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/6.0', 'B|7:1/5.0']   100            6          
6  653    unit1  2021-10-18  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/6.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/6.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/6.0', 'B|7:1/5.0']   100            5          
0  1094   unit2  2021-07-26  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|6:1/11.0, A|6:1/4.0, A|7:1/9.0, A|7:1/4.0, A|8:1/4.0, A|8:1/3.0, A|9:1/2.0]        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/4.0', 'A|8:1/4.0', 'A|8:1/3.0', 'A|9:1/2.0']        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/4.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/9.0', 'A|8:1/4.0', 'A|9:1/2.0']        95             1          
1  1008   unit2  2021-08-09  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|5:1/4.0, A|6:1/11.0, A|6:1/4.0, A|7:1/9.0, A|7:1/9.0, A|8:1/4.0, A|9:1/2.0]        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/4.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/9.0', 'A|8:1/4.0', 'A|9:1/2.0']        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/2.0', 'A|6:1/11.0', 'A|6:1/6.0', 'A|7:1/9.0', 'A|7:1/3.0', 'A|8:1/4.0', 'A|9:1/2.0']        98             2          
2  1089   unit2  2021-10-04  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|5:1/2.0, A|6:1/11.0, A|6:1/6.0, A|7:1/9.0, A|7:1/3.0, A|8:1/4.0, A|9:1/2.0]        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/2.0', 'A|6:1/11.0', 'A|6:1/6.0', 'A|7:1/9.0', 'A|7:1/3.0', 'A|8:1/4.0', 'A|9:1/2.0']        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/4.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/9.0', 'A|8:1/4.0', 'A|9:1/2.0']        98             1          
0  1031   unit3  2021-09-06  [A|10:1/7.0, A|12:1/2.0, A|5:1/10.0, A|5:1/2.0, A|6:1/12.0, A|6:1/11.0, A|6:1/4.0, A|7:1/9.0, A|7:1/6.0, A|9:1/2.0]      ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/10.0', 'A|5:1/2.0', 'A|6:1/12.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/6.0', 'A|9:1/2.0']      ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     99             1          
0  807    unit4  2021-07-19  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|4:1/2.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|4:1/2.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     99             1          
1  825    unit4  2021-07-26  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/2.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|4:1/2.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     99             0          
2  778    unit4  2021-08-23  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0, A|8:1/7.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0', 'A|8:1/7.0']     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/4.0', 'A|7:1/10.0', 'A|7:1/10.0', 'A|8:1/7.0']     99             3          
3  816    unit4  2021-08-30  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/2.0, A|6:1/5.0, A|6:1/4.0, A|7:1/10.0, A|7:1/10.0, A|8:1/7.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/4.0', 'A|7:1/10.0', 'A|7:1/10.0', 'A|8:1/7.0']     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0', 'A|8:1/7.0']     99             2          
4  822    unit4  2021-09-06  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/2.0, A|5:1/2.0, A|6:1/5.0, A|6:1/4.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/4.0', 'A|7:1/10.0', 'A|7:1/10.0']     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     99             1          
5  849    unit4  2021-09-27  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/3.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/3.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/3.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     100            7          
6  820    unit4  2021-10-04  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/5.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/5.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/2.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     99             1          
7  754    unit4  2021-10-18  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6.0, A|5:1/3.0, A|5:1/2.0, A|6:1/5.0, A|6:1/5.0, A|7:1/10.0, A|7:1/10.0]     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/3.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A|3:1/6.0', 'A|5:1/3.0', 'A|5:1/2.0', 'A|6:1/5.0', 'A|6:1/5.0', 'A|7:1/10.0', 'A|7:1/10.0']     100            5          
0  1044   unit5  2021-06-14  [A|10:1/6.0, A|10:1/6.0, A|5:1/4.0, A|5:1/4.0, A|6:1/10.0, A|6:1/9.0, A|6:1/9.0, A|7:1/8.0, A|7:1/8.0, A|8:1/3.0]        ['A|10:1/6.0', 'A|10:1/6.0', 'A|5:1/4.0', 'A|5:1/4.0', 'A|6:1/10.0', 'A|6:1/9.0', 'A|6:1/9.0', 'A|7:1/8.0', 'A|7:1/8.0', 'A|8:1/3.0']        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/4.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/9.0', 'A|8:1/4.0', 'A|9:1/2.0']        95             1                                                                                                                               

注意我已经创建了

a) 给出与另一个字符串的匹配百分比的列

b) 与字符串匹配的行的索引。但这并不是我真正想要的。实际上,我希望每个组的第一行与其自身和 match_index = 0 具有 100% 匹配,并且将另一个字符串与第一个字符串进行比较。

另一种我可以接受的方法如下:

fred = []
for unit in UNITS:
    d = df[df['unit']==unit]
    d = d.reset_index()
   
    score_sort = [(x,) + i
             for x in d['Vector_string'] 
             for i in process.extract(x, d['Vector_string'],scorer=fuzz.token_sort_ratio)]
 
    similarity_sort = pd.DataFrame(score_sort, columns=['Vector_string_r','Matched_vector','match_sort','score_sort'])
   
    final = d.join(similarity_sort)
    

    fred.append(final)
    
dfff = pd.concat(fred)

给出:

print(dfff.sort_values(['unit','Date_job']).head(10))

index   unit    Date_job                                                                                                                   Vector                                                                                                                                Vector_string                                                                                                                             Vector_string_r                                                                                                                              Matched_vector  match_sort  score_sort
0  59     unit1  2021-06-07  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/3.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  100         0         
1  662    unit1  2021-06-14  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|5:1/8.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|5:1/8.0', 'B|6:1/5.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  100         3         
2  680    unit1  2021-07-05  [A|14:1/9.0, A|14:1/4.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]  ['A|14:1/9.0', 'A|14:1/4.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  100         4         
3  725    unit1  2021-07-26  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/3.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/6.0', 'B|7:1/5.0']  98          5         
4  709    unit1  2021-08-30  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/3.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/5.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/6.0', 'B|7:1/5.0']  98          6         
5  703    unit1  2021-10-11  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/6.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/6.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|5:1/8.0', 'B|6:1/5.0', 'B|7:1/5.0']  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|5:1/8.0', 'B|6:1/5.0', 'B|7:1/5.0']  100         1         
6  653    unit1  2021-10-18  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/4.0, B|2:1/4.0, B|3:1/12.0, B|4:1/12.0, B|5:1/9.0, B|6:1/6.0, B|7:1/5.0]   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/6.0', 'B|7:1/5.0']   ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|5:1/8.0', 'B|6:1/5.0', 'B|7:1/5.0']  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', 'B|11:1/4.0', 'B|2:1/3.0', 'B|3:1/12.0', 'B|4:1/12.0', 'B|5:1/9.0', 'B|6:1/5.0', 'B|7:1/5.0']  96          0         
0  1094   unit2  2021-07-26  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|6:1/11.0, A|6:1/4.0, A|7:1/9.0, A|7:1/4.0, A|8:1/4.0, A|8:1/3.0, A|9:1/2.0]        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/4.0', 'A|8:1/4.0', 'A|8:1/3.0', 'A|9:1/2.0']        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/4.0', 'A|8:1/4.0', 'A|8:1/3.0', 'A|9:1/2.0']       ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/4.0', 'A|8:1/4.0', 'A|8:1/3.0', 'A|9:1/2.0']       100         0         
1  1008   unit2  2021-08-09  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|5:1/4.0, A|6:1/11.0, A|6:1/4.0, A|7:1/9.0, A|7:1/9.0, A|8:1/4.0, A|9:1/2.0]        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/4.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/9.0', 'A|8:1/4.0', 'A|9:1/2.0']        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/4.0', 'A|8:1/4.0', 'A|8:1/3.0', 'A|9:1/2.0']       ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/4.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/9.0', 'A|8:1/4.0', 'A|9:1/2.0']       97          1         
2  1089   unit2  2021-10-04  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|5:1/2.0, A|6:1/11.0, A|6:1/6.0, A|7:1/9.0, A|7:1/3.0, A|8:1/4.0, A|9:1/2.0]        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/2.0', 'A|6:1/11.0', 'A|6:1/6.0', 'A|7:1/9.0', 'A|7:1/3.0', 'A|8:1/4.0', 'A|9:1/2.0']        ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|6:1/11.0', 'A|6:1/4.0', 'A|7:1/9.0', 'A|7:1/4.0', 'A|8:1/4.0', 'A|8:1/3.0', 'A|9:1/2.0']       ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5:1/2.0', 'A|6:1/11.0', 'A|6:1/6.0', 'A|7:1/9.0', 'A|7:1/3.0', 'A|8:1/4.0', 'A|9:1/2.0']       95          2               

此方法确实解决了“将第一行与自身进行比较”的问题,但它不会将每个后续行与第一行进行比较(当然是针对每个单元!)。

非常感谢任何见解。

如果我没理解错的话,你想得到每个元素与第一个元素的相似性度量,对每个 unit 重复。一种方法:

  • *按 Date_job 排序,因此第一行定义明确(未显示)
  • 创建新列 first_vec,为每个 unit
  • 重复第一个 Vector_string
  • 为每一行计算fuzz.ratio(Vector_string, first_vec)
  • (清理温度列 first_vec
df["first_vec"] = df.groupby("unit").Vector_string.transform('first')
df["score"] = df.apply(lambda x: fuzz.ratio(x.Vector_string, x.first_vec), axis=1)
df.drop("first_vec", inplace=True)

输出:

       unit    Date_job                                             Vector                                      Vector_string  score
59    unit1  2021-06-07  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/...  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', '...    100
662   unit1  2021-06-14  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/...  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', '...     91
680   unit1  2021-07-05  [A|14:1/9.0, A|14:1/4.0, A|15:1/11.0, A|16:1/1...  ['A|14:1/9.0', 'A|14:1/4.0', 'A|15:1/11.0', 'A...     90
725   unit1  2021-07-26  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/...  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', '...    100
709   unit1  2021-08-30  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/...  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', '...    100
703   unit1  2021-10-11  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/...  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', '...     99
653   unit1  2021-10-18  [A|14:1/9.0, A|15:1/11.0, A|16:1/12.0, B|11:1/...  ['A|14:1/9.0', 'A|15:1/11.0', 'A|16:1/12.0', '...     99
807   unit4  2021-07-19  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6....  ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A...    100
825   unit4  2021-07-26  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6....  ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A...     99
778   unit4  2021-08-23  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6....  ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A...     90
816   unit4  2021-08-30  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6....  ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A...     90
822   unit4  2021-09-06  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6....  ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A...     99
849   unit4  2021-09-27  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6....  ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A...     99
820   unit4  2021-10-04  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6....  ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A...     99
754   unit4  2021-10-18  [A|10:1/13.0, A|10:1/13.0, A|3:1/6.0, A|3:1/6....  ['A|10:1/13.0', 'A|10:1/13.0', 'A|3:1/6.0', 'A...     99
1031  unit3  2021-09-06  [A|10:1/7.0, A|12:1/2.0, A|5:1/10.0, A|5:1/2.0...  ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/10.0', 'A|...    100
1094  unit2  2021-07-26  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|6:1/11.0...  ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|6...    100
1008  unit2  2021-08-09  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|5:1/4.0,...  ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5...     89
1089  unit2  2021-10-04  [A|10:1/7.0, A|12:1/2.0, A|5:1/9.0, A|5:1/2.0,...  ['A|10:1/7.0', 'A|12:1/2.0', 'A|5:1/9.0', 'A|5...     89
1044  unit5  2021-06-14  [A|10:1/6.0, A|10:1/6.0, A|5:1/4.0, A|5:1/4.0,...  ['A|10:1/6.0', 'A|10:1/6.0', 'A|5:1/4.0', 'A|5...    100

您可以使用与上述相同的工具包含第一行的索引值。