加快比较句子的比较功能

Question

我有一个形状为 (789174, 9) 的数据框。有一个名为 resolution 的列，其中包含一个长度小于 139 个字符的句子。我构建了一个函数来从 difflib 库中查找相似度得分高于 0.9 的句子。我有一台带有 96 cpus 和 384 gb ram 的虚拟计算机。我已经运行这个函数超过2个小时了，i = 1000的时候还没有处理。我担心这会花费太长时间来处理，我想知道是否有加快速度的方法。

def replace_similars(input_list):
    # Replaces %90 and more similar strings
    start_time = time.time()
    for i in range(len(input_list)):
        if i % 1000 == 0:
            print(f'time = {time.time()-start_time:.2f} - index = {i}')
        for j in range(len(input_list)):
            if i < j and difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
                input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping

很明显，因为我们在列中迭代两次，所以它是 O(n^2)。我不确定是否有办法让它更快。任何建议将不胜感激。

编辑：

我尝试使用 difflib 和 fuzzywuzzy 来加速。该函数只遍历该列一次，但我确实遍历了字典键。

def cluster_resolution(df):
    clusters = {}
    for string in df['resolution_modified'].unique():
        match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
        
        if match1:
            for m in match1:
                clusters[m].append(string)
        else:           
            clusters[string] = [ string ]
            for m in clusters.keys():
                match2 = fuzz.partial_ratio(string, m)
                if match2 >= 90:
                    clusters[m].append(string)
    return clusters
mappings = cluster_resolution(df_sample)

后一个功能是否可以加速？

这里是数据帧中一些数据的示例

d = {'resolution' : ['replaced scanner', 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use', 'tc reimage', 'updated pc', 'deploying replacement scanner', 'upgraded and rebooted station', 'printer has been reconfigured', 'cleared linux print queue and now it is working','user reset her password successfully closing tt','have reset the printer to get it to print again','i plugged usb cable into port and scanner works','reconfigured hand scanner and linked to station','replaced the scanner with station is functional','laptops battery needed to be reset asset serial','reconfigured scanner confirmed that it scans as intended','reimaging laptop corrected the anyconnect software issue','printer was unplugged from usb port working properly now','reconnected usb cable and reassign printer ports on port','reconfigured scanner to base and tested with aa all fine','replaced the defective device with a fresh imaged laptop','reconfigured the printer and the media to print properly','tested printer at station connected and working resolved','red scanner reconfigured and base rebooted via usb joint','station scanner was synced to base and station and is now working','printer offlineswitched usb portprinter is now online and working','replaced the barcode label with one reflecting the tcs ip address','restarted the thin client by using ssh to run the restart command','printer reconfigured and test they are functioning normally again','removed old printer for service installed replacement tested good','tc required reboot rebooted tc had aa signin dp is now functional','resetting the printer to factory settings and then reconfigure it','updated windows os forced update and the laptop operated normally','printer settings are set correct and printer is working correctly','power to printer was disconnected reconnected and is working fine','power cycled equipment and restocked spooler with plastic bubbles','laptop checked ive logged into paskiplacowepl without any problem','reseated scanner cables connection into usb port to resolve issue','the scanner has been replaced and the station is working well now']}

df = pd.DataFrame(data=d)

我如何定义相似度：

相似性实际上是由 replaced scanner 和 replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use 等采取的整体行动来定义的。较长的字符串整体操作取代了扫描器，因此这两个非常相似，这就是为什么我选择使用 partial_ratio 函数，因为它们的得分为 100。

注意：

请参考第二个函数cluster_resolution，因为这是我想加速的函数。后一个功能不会有用。

Answer 1

def replace_similars(input_list):
    # Replaces %90 and more similar strings
    start_time = time.time()
    for i in range(len(input_list)):
        if i % 1000 == 0:
            print(f'time = {time.time()-start_time:.2f} - index = {i}')
        for j in range(i+1, len(input_list)):
            if -15 < len(list(input_list[i])) - len(list(input_list[i])) < 15:
                if difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
                    input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping

尽管这可能不是一个实用的解决方案，因为如果每次迭代需要 0.1 秒，它也将花费大约 90 年，但它仍然是一个更优化的解决方案。

Answer 2

关于您上次的编辑，我将进行一些更改（主要使用 fuzzywuzzy.process 而不是 fuzzywuzzy.fuzz）：

from fuzzywuzzy import process
def cluster_resolution(df):
    clusters = {}
    for string in df['resolution'].unique():        
        match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
        if match1:
            for m in match1:
                clusters[m].append(string)
        else:           
            bests = process.extractBests(
                    string, 
                    set(clusters.keys())-{string},
                    scorer=fuzz.partial_ratio,
                    score_cutoff=80,
                    limit=1
                    )
            
            if bests:
                clusters[bests[0][0]].append(string)
            else:
                clusters[string] = [ string ]

但我认为您可以更多地研究其他解决方案，例如 CountVectorizer 以及那里适用的任何指标。这是一种提高速度的方法（因为它是矢量化的），尽管结果可能不完美。请注意，CountVectorizer 可能对您来说是一个很好的解决方案，因为您已经选择了 partial_ratio.

例如，像这样的东西：

from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform
import hdbscan

df = pd.DataFrame(d)

cv = CountVectorizer(stop_words="english")
transformed = cv.fit_transform(df['resolution'])
transformed = pd.DataFrame(
        transformed.toarray(), 
        columns=cv.get_feature_names(),
        index=df['resolution'])

#keep only columns with more than 1
transformed = transformed[transformed.columns[transformed.sum()>2]]

#compute the distance matrix
d = pdist(transformed, metric="hamming") * transformed.shape[1]
s = squareform(d)

clusterer = hdbscan.HDBSCAN(metric='precomputed', min_cluster_size=2)
clusterer.fit_predict(s)

df['labels'] = clusterer.labels_

print(df.sort_values('labels'))

我认为这仍然是可以完善的（这是我第一次尝试文本聚类...）。您还可以为 CountVectorizer 添加自己的停用词列表，这将是一种帮助算法的方法。至少，它可以帮助您在使用之前的函数之前对数据集进行预聚类，例如：

df.groupby('labels')['resolution'].apply(cluster_resolution)

（这样，如果您的第一个聚类大致没问题，您将只检查每个值与聚类中的所有其他值，而不是所有值）。

感谢 @anon01 在中计算距离矩阵，这似乎比 hdbscan 的默认值稍微好一些。

编辑：

再次尝试，包括：

指标变化，
使用 TF-IDF 模型添加一个步骤，
并添加一个步骤以使用 nltk 包对单词进行词形还原。

所以这将是：

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import pdist, squareform
import pandas as pd
import hdbscan
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet

d = {...}
df = pd.DataFrame(d)

lemmatizer = WordNetLemmatizer()

def lemmatization(sentence):
    
    tag_dict = {
                "J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV,
                }

    # Tokenize the sentence
    wordsList = nltk.word_tokenize(sentence) 
    
    # Find the right token
    tagged = nltk.pos_tag(wordsList)   
    
    # Convert the list of (token, tag) to lemmatized tokens
    lems = [
            lemmatizer.lemmatize(token, tag_dict.get(tag[0], wordnet.NOUN) )
            for token, tag
            in tagged
            ]

    lems = ' '.join(lems)
    return lems

df['lemmatized'] = df['resolution'].apply(lemmatization)

corpus = df['lemmatized']
pipe = Pipeline(
        [
                ('cv', CountVectorizer(stop_words="english")),
                ('tfid', TfidfTransformer())
         ]).fit(corpus)

transformed = pipe.transform(corpus)
transformed = pd.DataFrame(
        transformed.toarray(), 
        columns=pipe.named_steps['cv'].get_feature_names(),
        index=df['resolution'])

d = pdist(transformed, metric="cosine") * transformed.shape[1]
s = squareform(d)

clusterer = hdbscan.HDBSCAN(metric="precomputed", min_cluster_size=2)
clusterer.fit_predict(s)

df['labels'] = clusterer.labels_

print(df.sort_values('labels'))

您还可以添加一些特定的代码，因为您的示例似乎与非常具体的维护日志有关。

例如，您可以根据 hardware/sotfware 的一小部分列表向 transformed 数据框添加新功能：

#To create a feature about OS :
cols = ['os', 'linux', 'window']
transformed[cols[0]] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))

#To crate a feature about hardware :
cols = ["laptop", "printer", "scanner"]
transformed["hardware"] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))

此步骤可能有助于获得更好的结果，但可能不是必需的。我不确定它与 FuzzyWuzzy 匹配字符串的性能相比如何，但我会对您的反馈感兴趣！

加快比较句子的比较功能

Speeding up a comparison function for comparing sentences

python

difflib

pandas