生成在两个字符串列之间具有字符串相似性距离的新列的有效方法
Efficient way of generating new columns having string similarity distances between two string columns
我有一个 pandas 数据框,形状为 (1138812, 14)
,列数为
['id', 'name', 'latitude', 'longitude', 'address', 'city', 'state',
'zip', 'country', 'url', 'phone', 'categories', 'point_of_interest',
'id_2', 'name_2', 'latitude_2', 'longitude_2', 'address_2', 'city_2',
'state_2', 'zip_2', 'country_2', 'url_2', 'phone_2', 'categories_2',
'point_of_interest_2', 'match']
我想根据每个字符串之间的 Levenshtein and difflib difflib.SequenceMatcher().ratio()
、Levenshtein.distance()
、Levenshtein.jaro_winkler()
和 LongestCommonSubstring()
的字符串相似性距离创建新列列
['name', 'address', 'city', 'state',
zip', 'country', 'url', 'phone', 'categories']
和相应的 _2
后缀列。最后它会给我 9*4 = 36 个新列。
现在,我正在使用 df.iterrows()
循环遍历数据框并制作列列表。但它非常非常耗时和内存。使用完整的 16GB ram 内存需要 3.5 小时才能完成整个数据帧。我试图在时间和记忆方面找到一种更好的方法来获得我的结果。
我的代码:
import Levenshtein
import difflib
from tqdm.notebook import tqdm
columns = ['name', 'address', 'city', 'state',
'zip', 'country', 'url', 'phone', 'categories']
data_dict = {}
for i in columns:
data_dict[f"{i}_geshs"] = []
data_dict[f"{i}_levens"] = []
data_dict[f"{i}_jaros"] = []
data_dict[f"{i}_lcss"] = []
for i,row in tqdm(train.iterrows(),total = train.shape[0]):
for j in columns:
data_dict[f"{j}_geshs"].append(difflib.SequenceMatcher(None, row[j], row[f"{j}_2"]).ratio())
data_dict[f"{j}_levens"].append(Levenshtein.distance(row[j], row[f"{j}_2"]))
data_dict[f"{j}_jaros"].append(Levenshtein.jaro_winkler(row[j], row[f"{j}_2"]))
data_dict[f"{j}_lcss"].append(LCS(str(row[j]), str(row[f"{j}_2"])))
data = pd.DataFrame(data_dict)
train = pd.concat(train, data, axis = 1)
从如下所示的数据框开始:
first_name
address
city
state
zip
url
phone
categories
first_name_2
address_2
city_2
state_2
zip_2
url_2
phone_2
categories_2
Rori
680 Buell Crossing
Dallas
Texas
75277
url_shortened
214-533-2179
Granite Surfaces
Agustin
7 Schiller Crossing
Lubbock
Texas
79410
url_shortened
806-729-7419
Roofing (Metal)
Dmitri
05 Coolidge Way
Charleston
West Virginia
25356
url_shortened
304-906-6384
Structural and Misc Steel (Fabrication)
Kearney
0547 Clemons Plaza
Peoria
Illinois
61651
url_shortened
309-326-4252
Framing (Steel)
形状为1024000 rows × 16 columns
import difflib
import Levenshtein
import numpy as np
import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=8) # Customize based on # of cores, or leave blank to use all
def dists(x, y):
matcher = difflib.SequenceMatcher(None, x, y)
geshs = matcher.ratio()
levens = Levenshtein.distance(x, y)
jaros = Levenshtein.jaro_winkler(x, y)
lcss = matcher.find_longest_match(0, len(x), len(y)) # I wasn't sure how you'd done this one.
return [geshs, levens, jaros, lcss]
df = pd.read_csv('MOCK_DATA.csv')
df = df.astype(str) # force all fields to strings.
cols = df.columns
cols = np.array_split(cols, 2) # assumes there's a matching `_2` column for every column.
for x, y in zip(*cols):
(df[x + '_geshs'],
df[x + '_levens'],
df[x + '_jaros'],
df[x + '_lcss']) = df.parallel_apply(lambda z: dists(z[x], z[y]), axis=1, result_type='expand')
# Replace parallel_apply with apply to run non-parallel.
(除了保留原始专栏之外)我在 3 分钟内得到这些专栏,如果没有并行化,它仍然可能只需要 ~20-30 分钟。 python 的峰值内存使用量仅为 3GB 左右,如果不进行并行化处理,会低 很多 。
first_name_geshs
first_name_levens
first_name_jaros
first_name_lcss
address_geshs
address_levens
address_jaros
address_lcss
city_geshs
city_levens
city_jaros
city_lcss
state_geshs
state_levens
state_jaros
state_lcss
zip_geshs
zip_levens
zip_jaros
zip_lcss
url_geshs
url_levens
url_jaros
url_lcss
phone_geshs
phone_levens
phone_jaros
phone_lcss
categories_geshs
categories_levens
categories_jaros
categories_lcss
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
我有一个 pandas 数据框,形状为 (1138812, 14)
,列数为
['id', 'name', 'latitude', 'longitude', 'address', 'city', 'state',
'zip', 'country', 'url', 'phone', 'categories', 'point_of_interest',
'id_2', 'name_2', 'latitude_2', 'longitude_2', 'address_2', 'city_2',
'state_2', 'zip_2', 'country_2', 'url_2', 'phone_2', 'categories_2',
'point_of_interest_2', 'match']
我想根据每个字符串之间的 Levenshtein and difflib difflib.SequenceMatcher().ratio()
、Levenshtein.distance()
、Levenshtein.jaro_winkler()
和 LongestCommonSubstring()
的字符串相似性距离创建新列列
['name', 'address', 'city', 'state',
zip', 'country', 'url', 'phone', 'categories']
和相应的 _2
后缀列。最后它会给我 9*4 = 36 个新列。
现在,我正在使用 df.iterrows()
循环遍历数据框并制作列列表。但它非常非常耗时和内存。使用完整的 16GB ram 内存需要 3.5 小时才能完成整个数据帧。我试图在时间和记忆方面找到一种更好的方法来获得我的结果。
我的代码:
import Levenshtein
import difflib
from tqdm.notebook import tqdm
columns = ['name', 'address', 'city', 'state',
'zip', 'country', 'url', 'phone', 'categories']
data_dict = {}
for i in columns:
data_dict[f"{i}_geshs"] = []
data_dict[f"{i}_levens"] = []
data_dict[f"{i}_jaros"] = []
data_dict[f"{i}_lcss"] = []
for i,row in tqdm(train.iterrows(),total = train.shape[0]):
for j in columns:
data_dict[f"{j}_geshs"].append(difflib.SequenceMatcher(None, row[j], row[f"{j}_2"]).ratio())
data_dict[f"{j}_levens"].append(Levenshtein.distance(row[j], row[f"{j}_2"]))
data_dict[f"{j}_jaros"].append(Levenshtein.jaro_winkler(row[j], row[f"{j}_2"]))
data_dict[f"{j}_lcss"].append(LCS(str(row[j]), str(row[f"{j}_2"])))
data = pd.DataFrame(data_dict)
train = pd.concat(train, data, axis = 1)
从如下所示的数据框开始:
first_name | address | city | state | zip | url | phone | categories | first_name_2 | address_2 | city_2 | state_2 | zip_2 | url_2 | phone_2 | categories_2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Rori | 680 Buell Crossing | Dallas | Texas | 75277 | url_shortened | 214-533-2179 | Granite Surfaces | Agustin | 7 Schiller Crossing | Lubbock | Texas | 79410 | url_shortened | 806-729-7419 | Roofing (Metal) |
Dmitri | 05 Coolidge Way | Charleston | West Virginia | 25356 | url_shortened | 304-906-6384 | Structural and Misc Steel (Fabrication) | Kearney | 0547 Clemons Plaza | Peoria | Illinois | 61651 | url_shortened | 309-326-4252 | Framing (Steel) |
形状为1024000 rows × 16 columns
import difflib
import Levenshtein
import numpy as np
import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=8) # Customize based on # of cores, or leave blank to use all
def dists(x, y):
matcher = difflib.SequenceMatcher(None, x, y)
geshs = matcher.ratio()
levens = Levenshtein.distance(x, y)
jaros = Levenshtein.jaro_winkler(x, y)
lcss = matcher.find_longest_match(0, len(x), len(y)) # I wasn't sure how you'd done this one.
return [geshs, levens, jaros, lcss]
df = pd.read_csv('MOCK_DATA.csv')
df = df.astype(str) # force all fields to strings.
cols = df.columns
cols = np.array_split(cols, 2) # assumes there's a matching `_2` column for every column.
for x, y in zip(*cols):
(df[x + '_geshs'],
df[x + '_levens'],
df[x + '_jaros'],
df[x + '_lcss']) = df.parallel_apply(lambda z: dists(z[x], z[y]), axis=1, result_type='expand')
# Replace parallel_apply with apply to run non-parallel.
(除了保留原始专栏之外)我在 3 分钟内得到这些专栏,如果没有并行化,它仍然可能只需要 ~20-30 分钟。 python 的峰值内存使用量仅为 3GB 左右,如果不进行并行化处理,会低 很多 。
first_name_geshs | first_name_levens | first_name_jaros | first_name_lcss | address_geshs | address_levens | address_jaros | address_lcss | city_geshs | city_levens | city_jaros | city_lcss | state_geshs | state_levens | state_jaros | state_lcss | zip_geshs | zip_levens | zip_jaros | zip_lcss | url_geshs | url_levens | url_jaros | url_lcss | phone_geshs | phone_levens | phone_jaros | phone_lcss | categories_geshs | categories_levens | categories_jaros | categories_lcss |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 |
0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 |