进一步加快并行化进程

Further speeding up parallization process

我为标签创建了字典

labels = {
    0:['replaced scanner', 'scanner has been replaced', 'replaced the scanner', 'scanner was replaced', 'replaced scanner and tested', 'i replaced the scanner', 'deployed replacement scanner', 'replaced scanner with a new one', 'replaced scanner with', 'replaced damaged scanner', 'replaced scanner and synced to station', 'replaced scanner with asset', 'scanner replaced', 'replaced missing scanner', 'replaced scanner at station', 'replaced broken scanner', 'replaced defective scanner', 'replaced scanner for this station we are glad that we were able to assist you today ill go ahead and mark this ticket as resolved if this issue requires further attention from please let me know have a great day', 'replaced defective scanner with new'],
    1:['station has been rebooted', 'rebooted station', 'performed a remote reboot of the ar station', 'rebooted station remotely', 'remotely reset station', 'restarted station', 'station has been rebooted resolving ticket', 'reboot station verified station is fully operational', 'station rebooted verified that the station is back online', 'rebooted station operational', 'remotely rebooted station', 'reboot of station', 'station has been remotely rebooted', 'station rebooted', 'rebooted the station', 'station was rebooted', 'remotely rebooted the station', 'station was successfully rebooted resolving ticket', 'station has been rebooted verified that the station is back online', 'rebooted the station and all is well', 'station rebooted remotely', 'ar station remotely rebooted', 'rebooted station  issue resolved', 'station rebooted and verified up', 'after rebooting station works good', 'rebooted station issue resolved', 'station rebooted successfully no further issues reported', 'station remotely rebooted', 'successfully rebooted the station as requested confirmed the aa was able to log back into the station and get to work', 'rebooted station and tested', 'reboot station', 'sshd into station and reboot it then used sasd to verify that station came back up successfully', 'station rebooted verified station functionality issue resolved'],
    2:['password reset', 'password has been reset', 'reset password', 'reset users password', 'password rotated', 'peap password rotated', 'the password has been changed', 'reset password for associate', 'i assisted the user with a password reset', 'reset password resolving this tt', 'assisted with password reset at it hub kiosk', 'password successfully reset', 'password rotated on the peap portal', 'password rotation is completed', 'password resetunlock was performed after validating user identity  on resolution please refer over to the correspondence tab', 'successfully reset password', 'password reset successfully', 'reset user password', 'verified user assisted requester with password reset', 'password was reset', 'password changed', 'assisted with password reset', 'reset password for user', 'password reseted', 'password reset done', 'reset the pwd using the password tool hence resolving this tt', 'the password was updated', 'helped aa to reset their password via password tool with admin rights', 'assisted associate with password reset', 'sopno password has been reset', 'reset password for aa', 'password has been reset successfully', 'performed an inperson password reset via password tool', 'password reset for user'],
    3:['replaced printer', 'replaced the printer', 'printer replaced', 'printer has been replaced', 'printer was replaced', 'replaced printer and tested'],
    4:['rebooted thin client', 'tc has been remotely rebooted', 'rebooted the thin client', 'rebooted tc', 'rebooted thinclient', 'tc have been rebooted functionality has been reestablished'],
    5:['printer reconfigured', 'reconfigured printer with zebra tool successfully for ib destinations', 'reconfigured printer', 'printer configured', 'recalibrated printer', 'reconfigured the printer', 'recalibrated the printer', 'calibrated printer', 'configured printertested its working now', 'printer reconfigured its working now', 'printer has been reconfigured', 'configured and tested printer', 'printer has been configured', 'configured printer', 'pushed correct configuration to printer verified everything works resolving'],
    6:['laptop returned to it', 'laptop has been returned', 'laptop returned', 'the laptop has been returned', 'loaner laptop has been returned', 'loaned the user a laptop and made sure it returned to it', 'laptop returned closing', 'laptop replaced', 'laptop was returned', 'loaner laptop received from user', 'laptop returned closing the ticket'],
    7:['replaced keyboard', 'keyboard replaced', 'keyboard has been replaced', 'replaced the keyboard', 'keyboard was replaced', 'replaced defective keyboard'],
    8:['replaced scanner cable', 'scanner cable replaced', 'replaced the scanner cable', 'scanner cable has been replaced'],
    9:['replaced thin client', 'thin client replaced', 'replaced the thin client', 'thinclient replaced', 'tc replaced'],
    10:['scanner reconfigured', 'reconfigured scanner', 'scanner was reconfigured', 'reconfigured the scanner', 'scanner configured'],
    11:['replaced monitor', 'monitor replaced'],
    12:['reinstalled printer and drivers'],
    13:['replaced mouse', 'mouse replaced', 'mouse has been replaced', 'mouse was replaced', 'replaced the mouse', 'replaced defective mouse'],
    14:['stopstart spooler reconfigured printer up and running'],
    15:['restarted thin client', 'thin client rebooted', 'restarted the thin client', 'reboot thin client', 'rebooted the thinclient', 'performed hard reboot of thin client', 'thinclient rebooted'],
    16:['deployed scanner to station', 'deployed scanner', 'scanner deployed', 'deployed new scanner', 'deployed a scanner', 'scanners deployed', 'deployed new scanner to station'],
    17:['pslip cable was unplugged reconnected pslip cable then ran test print to verify that issue is resolved'],
    18:['cable replaced', 'replaced cable', 'cable has been replaced', 'replaced the cable'],
    19:['replaced battery', 'replaced the battery', 'battery replaced'],
    20:['unlocked account'],
    21:['reimaged laptop', 'reimaged the laptop', 'laptop reimaged'],
    22:['rollbacked mcm root cause should be found in master tt'],
    23:['reassigned ports and tested', 'issue resolved printer port reassigned', 'printer port was reassigned', 'reassigned printer ports verified slim', 'reassigned printer port'],
    24:['replaced laptop'],
    25:['resynced scanner to base'],
    26:['camera removed from proxemics', 'removed cameras', 'cameras removed as requested', 'cameras removed from proxemics', 'camera removed as requested', 'camera has been removed from proxemics', 'cameras have been removed from proxemics'],
    27:['the account has been unlocked'],
    28:['replaced screen','screen replaced'],
    29:['confirmed images are uploading correctly to'],
    30:['replaced laptop screen'],
    31:['replaced print head'],
    32:['monitor has been replaced','replaced the monitor'],
    33:['rebooted server per cm resolving to see if any alerts refire'],
    34:['replaced usb cable','usb cable replaced'],
    35:['reconnected usb cable','usb cable was disconnected connected back tested working ok now resolving'],
    36:['replaced power cable'],
    37:['wifi card replaced'],
    38:['reassigned printer ports','printer port reassigned issue resolved','reassigned printer ports slim updated'],
    39:['advised to reach out to global it'],
    40:['replaced pslip printer'],
    41:['moved cameras to server'],
    42:['restarted print spooler'],
    43:['replaced hand scanner'],
    44:['resynced scanner to station'],
    45:['upgraded ios version on switch verified all connections to uplinks are restored'],
    46:['printer installed', 'printer deployed'],
    47:['xterm replaced']
}

然后我构建了一个函数来遍历 pandas 列中的每个字符串以找到它们应该属于

的关联桶(键)
from fuzzywuzzy import fuzz
def cluster_resolution(df, cluster_no, cluster_list):
    for res_string in df['resolution'].unique():
        a = set()
        for val in cluster_list:
            if fuzz.partial_ratio(res_string, val) >= 90:
                a.add(res_string)
        cluster_list.extend(a)
    return {cluster_no:cluster_list}

然后我 运行 一个并行化作业

l = Parallel(n_jobs=-2)(delayed(cluster_resolution)(df = df_sample,
                                                     cluster_no = cluster_no,
                                                    cluster_list = cluster_list)
                                                    for cluster_no, cluster_list in labels.items())

运行 我的 1000 行数据帧的子样本上的后一个代码在没有并行化的情况下大约需要 5 分钟,而在并行化的情况下大约需要 40 秒。我必须在形状为 (1098118, 9) 的整个数据框中 运行。我想看看是否有一种方法可以加快这个过程,而不是仅仅使用更强大的机器。非常感谢任何建议或推荐。

这是数据框的示例

d = {'resolution' : ['replaced scanner', 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use', 'tc reimage', 'updated pc', 'deploying replacement scanner', 'upgraded and rebooted station', 'printer has been reconfigured', 'cleared linux print queue and now it is working','user reset her password successfully closing tt','have reset the printer to get it to print again','i plugged usb cable into port and scanner works','reconfigured hand scanner and linked to station','replaced the scanner with station is functional','laptops battery needed to be reset asset serial','reconfigured scanner confirmed that it scans as intended','reimaging laptop corrected the anyconnect software issue','printer was unplugged from usb port working properly now','reconnected usb cable and reassign printer ports on port','reconfigured scanner to base and tested with aa all fine','replaced the defective device with a fresh imaged laptop','reconfigured the printer and the media to print properly','tested printer at station connected and working resolved','red scanner reconfigured and base rebooted via usb joint','station scanner was synced to base and station and is now working','printer offlineswitched usb portprinter is now online and working','replaced the barcode label with one reflecting the tcs ip address','restarted the thin client by using ssh to run the restart command','printer reconfigured and test they are functioning normally again','removed old printer for service installed replacement tested good','tc required reboot rebooted tc had aa signin dp is now functional','resetting the printer to factory settings and then reconfigure it','updated windows os forced update and the laptop operated normally','printer settings are set correct and printer is working correctly','power to printer was disconnected reconnected and is working fine','power cycled equipment and restocked spooler with plastic bubbles','laptop checked ive logged into paskiplacowepl without any problem','reseated scanner cables connection into usb port to resolve issue','the scanner has been replaced and the station is working well now']}

df_sample = pd.DataFrame(data=d)

这里有几件事。

  1. 我认为使用模糊匹配并以不同的词序搜索相同的字符串是多余的。例如,无需搜索 'replaced printer' 和 'printer replaced'。您可以在 fuzzywuzzy 中设置匹配分数选项来解决这个问题。
  2. 我认为字典是个好主意,但我认为你应该颠倒键和值,其中键是匹配的词,int 是值。

我认为并行化不是必需的。尝试这样的操作,如果需要的时间太长请告诉我。

from fuzzywuzzy import fuzz
from fuzzywuzzy.process import extractOne
d = {'resolution' : ['replaced scanner', 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use', 'tc reimage', 'updated pc', 'deploying replacement scanner', 'upgraded and rebooted station', 'printer has been reconfigured', 'cleared linux print queue and now it is working','user reset her password successfully closing tt','have reset the printer to get it to print again','i plugged usb cable into port and scanner works','reconfigured hand scanner and linked to station','replaced the scanner with station is functional','laptops battery needed to be reset asset serial','reconfigured scanner confirmed that it scans as intended','reimaging laptop corrected the anyconnect software issue','printer was unplugged from usb port working properly now','reconnected usb cable and reassign printer ports on port','reconfigured scanner to base and tested with aa all fine','replaced the defective device with a fresh imaged laptop','reconfigured the printer and the media to print properly','tested printer at station connected and working resolved','red scanner reconfigured and base rebooted via usb joint','station scanner was synced to base and station and is now working','printer offlineswitched usb portprinter is now online and working','replaced the barcode label with one reflecting the tcs ip address','restarted the thin client by using ssh to run the restart command','printer reconfigured and test they are functioning normally again','removed old printer for service installed replacement tested good','tc required reboot rebooted tc had aa signin dp is now functional','resetting the printer to factory settings and then reconfigure it','updated windows os forced update and the laptop operated normally','printer settings are set correct and printer is working correctly','power to printer was disconnected reconnected and is working fine','power cycled equipment and restocked spooler with plastic bubbles','laptop checked ive logged into paskiplacowepl without any problem','reseated scanner cables connection into usb port to resolve issue','the scanner has been replaced and the station is working well now']}
df = pd.DataFrame(data=d)

#reverse the key and val of labels dict, and only select the first item from the val list.
labels2 = {val[0]:key for key, val in labels.items()}
#create the list of keys outside of the apply function so we are not creating a new list with every row
label_keys = labels2.keys()
df['resolution_key']=df.resolution.apply(lambda res: labels2[
    extractOne(res,label_keys,
               scorer=fuzz.token_set_ratio)[0]])

输出:

>>> df
                                           resolution  resolution_key
0                                    replaced scanner               0
1   replaced the scanner for the user with a prope...               0
2                                          tc reimage              21
3                                          updated pc              18
4                       deploying replacement scanner              43
5                       upgraded and rebooted station               1
6                       printer has been reconfigured               5
7     cleared linux print queue and now it is working              12
8     user reset her password successfully closing tt               2
9     have reset the printer to get it to print again               3
10    i plugged usb cable into port and scanner works               8
11    reconfigured hand scanner and linked to station              10
12    replaced the scanner with station is functional               0
13    laptops battery needed to be reset asset serial              19
14  reconfigured scanner confirmed that it scans a...              10
15  reimaging laptop corrected the anyconnect soft...              21
16  printer was unplugged from usb port working pr...               3
17  reconnected usb cable and reassign printer por...              35
18  reconfigured scanner to base and tested with a...              10
19  replaced the defective device with a fresh ima...              24
20  reconfigured the printer and the media to prin...               5
21  tested printer at station connected and workin...               3
22  red scanner reconfigured and base rebooted via...              10
23  station scanner was synced to base and station...              16
24  printer offlineswitched usb portprinter is now...               3
25  replaced the barcode label with one reflecting...              13
26  restarted the thin client by using ssh to run ...              15
27  printer reconfigured and test they are functio...               5
28  removed old printer for service installed repl...              46
29  tc required reboot rebooted tc had aa signin d...               4
30  resetting the printer to factory settings and ...               3
31  updated windows os forced update and the lapto...              21
32  printer settings are set correct and printer i...               3
33  power to printer was disconnected reconnected ...              35
34  power cycled equipment and restocked spooler w...              14
35  laptop checked ive logged into paskiplacowepl ...              21
36  reseated scanner cables connection into usb po...               0
37  the scanner has been replaced and the station ...               0

我建议使用 rapidfuzz instead of fuzzywuzzy. It is built in C++ and has some other improvements that allow for significant speed improvements

只需将 from fuzzywuzzy import fuzz 替换为 from rapidfuzz import fuzz 即可。