
Efficient way of generating new columns having string similarity distances between two string columns

我有一个 pandas 数据框,形状为 (1138812, 14),列数为

['id', 'name', 'latitude', 'longitude', 'address', 'city', 'state',
       'zip', 'country', 'url', 'phone', 'categories', 'point_of_interest',
       'id_2', 'name_2', 'latitude_2', 'longitude_2', 'address_2', 'city_2',
       'state_2', 'zip_2', 'country_2', 'url_2', 'phone_2', 'categories_2',
       'point_of_interest_2', 'match']

我想根据每个字符串之间的 Levenshtein and difflib difflib.SequenceMatcher().ratio()Levenshtein.distance()Levenshtein.jaro_winkler()LongestCommonSubstring() 的字符串相似性距离创建新列列

['name', 'address', 'city', 'state',
       zip', 'country', 'url', 'phone', 'categories']

和相应的 _2 后缀列。最后它会给我 9*4 = 36 个新列。 现在,我正在使用 df.iterrows() 循环遍历数据框并制作列列表。但它非常非常耗时和内存。使用完整的 16GB ram 内存需要 3.5 小时才能完成整个数据帧。我试图在时间和记忆方面找到一种更好的方法来获得我的结果。 我的代码:

import Levenshtein
import difflib
from tqdm.notebook import tqdm
columns = ['name', 'address', 'city', 'state',
           'zip', 'country', 'url', 'phone', 'categories']
data_dict = {}
for i in columns:
    data_dict[f"{i}_geshs"] = []
    data_dict[f"{i}_levens"] = []
    data_dict[f"{i}_jaros"] = []
    data_dict[f"{i}_lcss"] = []
for i,row in tqdm(train.iterrows(),total = train.shape[0]):
    for j in columns:
        data_dict[f"{j}_geshs"].append(difflib.SequenceMatcher(None, row[j], row[f"{j}_2"]).ratio())
        data_dict[f"{j}_levens"].append(Levenshtein.distance(row[j], row[f"{j}_2"]))
        data_dict[f"{j}_jaros"].append(Levenshtein.jaro_winkler(row[j], row[f"{j}_2"]))
        data_dict[f"{j}_lcss"].append(LCS(str(row[j]), str(row[f"{j}_2"])))
data = pd.DataFrame(data_dict)
train = pd.concat(train, data, axis = 1)


first_name address city state zip url phone categories first_name_2 address_2 city_2 state_2 zip_2 url_2 phone_2 categories_2
Rori 680 Buell Crossing Dallas Texas 75277 url_shortened 214-533-2179 Granite Surfaces Agustin 7 Schiller Crossing Lubbock Texas 79410 url_shortened 806-729-7419 Roofing (Metal)
Dmitri 05 Coolidge Way Charleston West Virginia 25356 url_shortened 304-906-6384 Structural and Misc Steel (Fabrication) Kearney 0547 Clemons Plaza Peoria Illinois 61651 url_shortened 309-326-4252 Framing (Steel)

形状为1024000 rows × 16 columns

import difflib
import Levenshtein
import numpy as np
import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=8) # Customize based on # of cores, or leave blank to use all

def dists(x, y):
    matcher = difflib.SequenceMatcher(None, x, y)
    geshs = matcher.ratio()
    levens = Levenshtein.distance(x, y)
    jaros = Levenshtein.jaro_winkler(x, y)
    lcss = matcher.find_longest_match(0, len(x), len(y)) # I wasn't sure how you'd done this one.
    return [geshs, levens, jaros, lcss]

df = pd.read_csv('MOCK_DATA.csv')
df = df.astype(str) # force all fields to strings.

cols = df.columns
cols = np.array_split(cols, 2) # assumes there's a matching `_2` column for every column.
for x, y in zip(*cols):
    (df[x + '_geshs'], 
     df[x + '_levens'], 
     df[x + '_jaros'], 
     df[x + '_lcss']) = df.parallel_apply(lambda z: dists(z[x], z[y]), axis=1, result_type='expand')
    # Replace parallel_apply with apply to run non-parallel.

(除了保留原始专栏之外)我在 3 分钟内得到这些专栏,如果没有并行化,它仍然可能只需要 ~20-30 分钟。 python 的峰值内存使用量仅为 3GB 左右,如果不进行并行化处理,会低 很多

first_name_geshs first_name_levens first_name_jaros first_name_lcss address_geshs address_levens address_jaros address_lcss city_geshs city_levens city_jaros city_lcss state_geshs state_levens state_jaros state_lcss zip_geshs zip_levens zip_jaros zip_lcss url_geshs url_levens url_jaros url_lcss phone_geshs phone_levens phone_jaros phone_lcss categories_geshs categories_levens categories_jaros categories_lcss
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3