两个 DataFrame 之间的慢速模糊匹配
Slow fuzzy matching between two DataFrames
我有 DataFrame A (df_cam
),其 cli id 和来源:
cli id | origin
------------------------------------
123 | 1234 M-MKT XYZklm 05/2016
DataFrame B (df_dict
) 具有快捷方式和活动
shortcut | campaign
------------------------------------
M-MKT | Mobile Marketing Outbound
我知道例如来自 1234 M-MKT XYZklm 05/2016
的客户实际上来自活动 Mobile Marketing Outbound
因为它包含关键字 M-MKT
.
注意shortcut是一个通用的关键字,根据算法应该决定的。原点也可以是 M-Marketing
、MMKT
或 Mob-MKT
。我首先通过分析所有来源手动创建了快捷方式列表。在将 origin
提取到程序之前,我还使用正则表达式对其进行清理。
我想通过快捷方式将客户来源与活动进行匹配,并附上分数以查看差异。如下图:
cli id | shortcut | origin | campaign | Score
---------------------------------------------------------------------------------
123 | M-MKT | 1234 M-MKT XYZklm 05/2016 | Mobile Marketing Outbound | 0.93
下面是我的程序,可以运行,但是 真的 慢。 DataFrame A 有大约 400.000 行,另一个 DataFrame B 有大约 40 行。
有没有办法让它更快?
from fuzzywuzzy import fuzz
list_values = df_dict['Shortcut'].values.tolist()
def TopFuzzMatch(tokenA, dict_, position, value):
"""
Calculates similarity between two tokens and returns TOP match and score
-----------------------------------------------------------------------
tokenA: similarity to this token will be calculated
dict_a: list with shortcuts
position: whether I want first, second, third...TOP position
value: 0=similarity score, 1=associated shortcut
-----------------------------------------------------------------------
"""
sim = [(fuzz.token_sort_ratio(x, tokenA),x) for x in dict_]
sim.sort(key=lambda tup: tup[0], reverse=True)
return sim[position][value]
df_cam['1st_choice_short'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,1), axis=1 )
df_cam['1st_choice_sim'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,0), axis=1 )
请注意,我还想计算第二和第三最佳匹配以评估准确性。
编辑
我找到了process.ExtractOne
方法,但是速度还是一样。
所以我的代码现在看起来像这样:
def TopFuzzMatch(token, dict_, value):
score = process.extractOne(token, dict_, scorer=fuzz.token_sort_ratio)
return score[value]
我找到了一个解决方案 - 在我用正则表达式(没有数字和特殊字符)清理原始列后,只有几百个重复的不同值,所以我只在这些值上计算 Fuzz 算法,这显着提高了时间.
def TopFuzzMatch(df_cam, df_dict):
"""
Calculates similarity bewteen two tokens and return TOP match
The idea is to do it only over distinct values in given DF (takes ages otherwise)
-----------------------------------------------------------------------
df_cam: DataFrame with client id and origin
df_dict: DataFrame with abbreviation which is matched with the description i need
-----------------------------------------------------------------------
"""
#Clean special characters and numbers
df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1)
#Get unique values and calculate similarity
uq_origin = np.unique(df_cam['clean_camp'].values.ravel())
top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin]
#To DataFrame
df_match = pd.DataFrame({'unique': uq_origin})
df_match['top_match'] = top_match
#Merge
df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique')
df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut')
return df_cam
df_out = TopFuzzMatch(df_cam, df_dict)
我有 DataFrame A (df_cam
),其 cli id 和来源:
cli id | origin
------------------------------------
123 | 1234 M-MKT XYZklm 05/2016
DataFrame B (df_dict
) 具有快捷方式和活动
shortcut | campaign
------------------------------------
M-MKT | Mobile Marketing Outbound
我知道例如来自 1234 M-MKT XYZklm 05/2016
的客户实际上来自活动 Mobile Marketing Outbound
因为它包含关键字 M-MKT
.
注意shortcut是一个通用的关键字,根据算法应该决定的。原点也可以是 M-Marketing
、MMKT
或 Mob-MKT
。我首先通过分析所有来源手动创建了快捷方式列表。在将 origin
提取到程序之前,我还使用正则表达式对其进行清理。
我想通过快捷方式将客户来源与活动进行匹配,并附上分数以查看差异。如下图:
cli id | shortcut | origin | campaign | Score
---------------------------------------------------------------------------------
123 | M-MKT | 1234 M-MKT XYZklm 05/2016 | Mobile Marketing Outbound | 0.93
下面是我的程序,可以运行,但是 真的 慢。 DataFrame A 有大约 400.000 行,另一个 DataFrame B 有大约 40 行。
有没有办法让它更快?
from fuzzywuzzy import fuzz
list_values = df_dict['Shortcut'].values.tolist()
def TopFuzzMatch(tokenA, dict_, position, value):
"""
Calculates similarity between two tokens and returns TOP match and score
-----------------------------------------------------------------------
tokenA: similarity to this token will be calculated
dict_a: list with shortcuts
position: whether I want first, second, third...TOP position
value: 0=similarity score, 1=associated shortcut
-----------------------------------------------------------------------
"""
sim = [(fuzz.token_sort_ratio(x, tokenA),x) for x in dict_]
sim.sort(key=lambda tup: tup[0], reverse=True)
return sim[position][value]
df_cam['1st_choice_short'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,1), axis=1 )
df_cam['1st_choice_sim'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,0), axis=1 )
请注意,我还想计算第二和第三最佳匹配以评估准确性。
编辑
我找到了process.ExtractOne
方法,但是速度还是一样。
所以我的代码现在看起来像这样:
def TopFuzzMatch(token, dict_, value):
score = process.extractOne(token, dict_, scorer=fuzz.token_sort_ratio)
return score[value]
我找到了一个解决方案 - 在我用正则表达式(没有数字和特殊字符)清理原始列后,只有几百个重复的不同值,所以我只在这些值上计算 Fuzz 算法,这显着提高了时间.
def TopFuzzMatch(df_cam, df_dict):
"""
Calculates similarity bewteen two tokens and return TOP match
The idea is to do it only over distinct values in given DF (takes ages otherwise)
-----------------------------------------------------------------------
df_cam: DataFrame with client id and origin
df_dict: DataFrame with abbreviation which is matched with the description i need
-----------------------------------------------------------------------
"""
#Clean special characters and numbers
df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1)
#Get unique values and calculate similarity
uq_origin = np.unique(df_cam['clean_camp'].values.ravel())
top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin]
#To DataFrame
df_match = pd.DataFrame({'unique': uq_origin})
df_match['top_match'] = top_match
#Merge
df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique')
df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut')
return df_cam
df_out = TopFuzzMatch(df_cam, df_dict)