删除具有复杂逻辑的嵌套循环的最佳方法
Best way to remove nested loop with complex logic
我有一个程序将属性的分布sheet读入一个DataFrame,然后查询一个SQL数据库并创建另一个DataFrame,然后对两者运行余弦相似度函数来判断spreadsheet 中的哪些地址在我的数据库中。
我的余弦相似度函数的代码以及一些辅助函数如下。我遇到的问题是,在包含数百或数千个地址的 sheet 上,它非常慢,因为它使用嵌套的 for 循环为每个地址创建最佳相似性列表。
import string
import math
import re
from collections import Counter
WORD = re.compile(r"\w+")
def clean_address(text):
text = ''.join([word for word in text if word not in string.punctuation])
text = text.lower()
return text
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def getCosineSimilarities(internalDataframe, externalDataframe):
similarities = []
internalAddressColumn = internalDataframe['Address']
internalPostcodeColumn = internalDataframe['postcode']
externalAddressColumn = externalDataframe['full address']
externalPostcodeColumn = externalDataframe['postcode']
for i in range(len(internalDataframe)):
bestSimilarity = 0
for j in range(len(externalDataframe)):
if internalPostcodeColumn.iloc[i].rstrip() == externalPostcodeColumn.iloc[j]:
vector1 = text_to_vector(clean_address(internalAddressColumn.iloc[i]))
vector2 = text_to_vector(clean_address(externalAddressColumn.iloc[j]))
cosine = get_cosine(vector1, vector2)
if cosine > bestSimilarity:
bestSimilarity = cosine
similarities.append(bestSimilarity)
return similarities
我确信一定可以创建列表“相似性”,由 getCosineSimilarities 返回,使用列表理解或类似的东西,但我想不出最好的方法。
有人可以帮忙吗?
编辑:
internalDataframe.head(5)
Name postcode Created
0 Mr Joe Bloggs SW6 6RD 2020-10-21 14:15:58.140
1 Mrs Joanne Bloggs SE17 1LN 2013-06-27 14:52:29.417
2 Mr John Doe SW17 0LN 2017-02-23 16:22:03.630
3 Mrs Joanne Doe SW6 7JX 2019-07-03 14:52:00.773
4 Mr Joe Public W5 2RX 2012-11-19 10:28:47.863
externalDataframe.head(5)
address_id category beds postcode
1005214 FLA 2 NW5 4DA
1009390 FLA 2 NW5 1PB
1053948 FLA 2 NW6 3SJ
1075629 FLA 2 NW6 7UP
1084325 FLA 2 NW6 7YQ
如您所说,这里的问题是嵌套循环。
对于 internalDataframe
:
中的每个项目,您正在对 externalDataframe
执行相当多的昂贵操作
text_to_vector
涉及正则表达式 findall
和 counter
创建。您可以记住 externalDataframe
中的值并相应地修改您的函数
get_cosine
涉及 set
的转换和 counter
中所有项目的 sum
的权力。同样,您可以记住 externalDataframe
中的值并相应地修改您的函数。在这种情况下,您可能还想记住 internalDataframe
的结果
- 不太重要,
for x in list(vec1.keys())
是多余的:您强制将 dict_keys
转换为 list
(一次迭代),然后迭代 list
(另一次迭代)。只要做 for x in vec1.keys()
- 更不重要的是,您可以在计算它们的平方根的乘积之前检查
sum1
或 sum2
中的一个是否为零,而不是检查该乘积是否为零
看来你需要距离矩阵之类的东西。基于 ,这是一个关于如何比较来自两个数据框列的所有字符串对的草图:
import pandas as pd
import numpy as np
from collections import Counter
import math
def text2vec(text):
# just a naive transformation
return Counter(text.split())
def get_cosine(text1, text2):
"""Modified version of your function – you might want to improve
it some more following gimix's advice or even better, make
full use of numpy arrays"""
vec1, vec2 = text2vec(text1), text2vec(text2)
intersection = set(vec1) & set(vec2)
numerator = sum(vec1[x] * vec2[x] for x in intersection)
sum1 = sum(v ** 2 for v in vec1.values())
sum2 = sum(v ** 2 for v in vec2.values())
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
# Makes your function vector-ready
cos_sim = np.vectorize(get_cosine)
# Some pseudo data
data = {"address":["An address in some city",
"Cool location in some town",
"100 places to see before you die"]}
data2 = {"address":["Disney world",
"An address in some city",
"500 places to see before you die",
"Neat location in some town"]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
# Compare all combinations and combine to a new dataframe
# This is 1:1 adopted from the answer linked above
cos_matrix = cos_sim(df.address.values, df2.address.values[:, None])
result_df = pd.concat((df2.address,
pd.DataFrame(cos_matrix,
columns=df.address)),
axis=1)
print(result_df)
这给了你所有的价值,你可以用 max
:
得到最好的
address An address in some... Cool location in... 100 places to see...
0 Disney world 0.0 0.0 0.000000
1 An address in some city 1.0 0.4 0.000000
2 500 places to see before you die 0.0 0.0 0.857143
3 Neat location in some town 0.4 0.8 0.000000
@SNygard 值得在这里受到赞扬,因为他的评论让我朝着正确的方向前进(其他人需要说出其他两个答案是否有帮助 - 我以这种方式开车并且没有回头)。
我在 internalDataFrame 中创建了一个列以将索引保留为可用值,创建了一个字典来存储每个索引的最佳相似性(从全 0 开始),然后按照建议合并两个 DataFrame。这意味着我只需要遍历一次合并的 DataFrame 并更新相关的相似性字典。
它将处理相似性的时间从 500 个地址的 externalDataFrame 的约 15 秒减少到 0.5 秒以下,我还在 4.5 秒内运行 它与 6,000 个地址的 externalDataFrame 相比,我无法与任何东西进行比较,因为在上一个版本中处理它实际上需要几个小时!
def getCosineSimilarities(internalDataframe, externalDataframe):
internalDataframe['index'] = internalDataframe.index
combinedDf = pd.merge(internalDataframe, externalDataframe, on='postcode')
similarities_dict = dict()
for i in range(len(internalDataframe)):
index = internalDataframe['index'].iloc[i]
similarities_dict[index] = 0
for i in range(len(combinedDf)):
vector1 = text_to_vector(clean_address(combinedDf['Address'].iloc[i]))
vector2 = text_to_vector(clean_address(combinedDf['full address'].iloc[i]))
cosine = get_cosine(vector1, vector2)
index = combinedDf['index'].iloc[i]
if cosine > similarities_dict[index]:
similarities_dict[index] = cosine
similarities = []
for key, value in similarities_dict.items():
similarities.append(value)
return similarities
我有一个程序将属性的分布sheet读入一个DataFrame,然后查询一个SQL数据库并创建另一个DataFrame,然后对两者运行余弦相似度函数来判断spreadsheet 中的哪些地址在我的数据库中。
我的余弦相似度函数的代码以及一些辅助函数如下。我遇到的问题是,在包含数百或数千个地址的 sheet 上,它非常慢,因为它使用嵌套的 for 循环为每个地址创建最佳相似性列表。
import string
import math
import re
from collections import Counter
WORD = re.compile(r"\w+")
def clean_address(text):
text = ''.join([word for word in text if word not in string.punctuation])
text = text.lower()
return text
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def getCosineSimilarities(internalDataframe, externalDataframe):
similarities = []
internalAddressColumn = internalDataframe['Address']
internalPostcodeColumn = internalDataframe['postcode']
externalAddressColumn = externalDataframe['full address']
externalPostcodeColumn = externalDataframe['postcode']
for i in range(len(internalDataframe)):
bestSimilarity = 0
for j in range(len(externalDataframe)):
if internalPostcodeColumn.iloc[i].rstrip() == externalPostcodeColumn.iloc[j]:
vector1 = text_to_vector(clean_address(internalAddressColumn.iloc[i]))
vector2 = text_to_vector(clean_address(externalAddressColumn.iloc[j]))
cosine = get_cosine(vector1, vector2)
if cosine > bestSimilarity:
bestSimilarity = cosine
similarities.append(bestSimilarity)
return similarities
我确信一定可以创建列表“相似性”,由 getCosineSimilarities 返回,使用列表理解或类似的东西,但我想不出最好的方法。
有人可以帮忙吗?
编辑: internalDataframe.head(5)
Name postcode Created
0 Mr Joe Bloggs SW6 6RD 2020-10-21 14:15:58.140
1 Mrs Joanne Bloggs SE17 1LN 2013-06-27 14:52:29.417
2 Mr John Doe SW17 0LN 2017-02-23 16:22:03.630
3 Mrs Joanne Doe SW6 7JX 2019-07-03 14:52:00.773
4 Mr Joe Public W5 2RX 2012-11-19 10:28:47.863
externalDataframe.head(5)
address_id category beds postcode
1005214 FLA 2 NW5 4DA
1009390 FLA 2 NW5 1PB
1053948 FLA 2 NW6 3SJ
1075629 FLA 2 NW6 7UP
1084325 FLA 2 NW6 7YQ
如您所说,这里的问题是嵌套循环。
对于 internalDataframe
:
externalDataframe
执行相当多的昂贵操作
text_to_vector
涉及正则表达式findall
和counter
创建。您可以记住externalDataframe
中的值并相应地修改您的函数get_cosine
涉及set
的转换和counter
中所有项目的sum
的权力。同样,您可以记住externalDataframe
中的值并相应地修改您的函数。在这种情况下,您可能还想记住internalDataframe
的结果
- 不太重要,
for x in list(vec1.keys())
是多余的:您强制将dict_keys
转换为list
(一次迭代),然后迭代list
(另一次迭代)。只要做for x in vec1.keys()
- 更不重要的是,您可以在计算它们的平方根的乘积之前检查
sum1
或sum2
中的一个是否为零,而不是检查该乘积是否为零
看来你需要距离矩阵之类的东西。基于
import pandas as pd
import numpy as np
from collections import Counter
import math
def text2vec(text):
# just a naive transformation
return Counter(text.split())
def get_cosine(text1, text2):
"""Modified version of your function – you might want to improve
it some more following gimix's advice or even better, make
full use of numpy arrays"""
vec1, vec2 = text2vec(text1), text2vec(text2)
intersection = set(vec1) & set(vec2)
numerator = sum(vec1[x] * vec2[x] for x in intersection)
sum1 = sum(v ** 2 for v in vec1.values())
sum2 = sum(v ** 2 for v in vec2.values())
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
# Makes your function vector-ready
cos_sim = np.vectorize(get_cosine)
# Some pseudo data
data = {"address":["An address in some city",
"Cool location in some town",
"100 places to see before you die"]}
data2 = {"address":["Disney world",
"An address in some city",
"500 places to see before you die",
"Neat location in some town"]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
# Compare all combinations and combine to a new dataframe
# This is 1:1 adopted from the answer linked above
cos_matrix = cos_sim(df.address.values, df2.address.values[:, None])
result_df = pd.concat((df2.address,
pd.DataFrame(cos_matrix,
columns=df.address)),
axis=1)
print(result_df)
这给了你所有的价值,你可以用 max
:
address An address in some... Cool location in... 100 places to see...
0 Disney world 0.0 0.0 0.000000
1 An address in some city 1.0 0.4 0.000000
2 500 places to see before you die 0.0 0.0 0.857143
3 Neat location in some town 0.4 0.8 0.000000
@SNygard 值得在这里受到赞扬,因为他的评论让我朝着正确的方向前进(其他人需要说出其他两个答案是否有帮助 - 我以这种方式开车并且没有回头)。
我在 internalDataFrame 中创建了一个列以将索引保留为可用值,创建了一个字典来存储每个索引的最佳相似性(从全 0 开始),然后按照建议合并两个 DataFrame。这意味着我只需要遍历一次合并的 DataFrame 并更新相关的相似性字典。
它将处理相似性的时间从 500 个地址的 externalDataFrame 的约 15 秒减少到 0.5 秒以下,我还在 4.5 秒内运行 它与 6,000 个地址的 externalDataFrame 相比,我无法与任何东西进行比较,因为在上一个版本中处理它实际上需要几个小时!
def getCosineSimilarities(internalDataframe, externalDataframe):
internalDataframe['index'] = internalDataframe.index
combinedDf = pd.merge(internalDataframe, externalDataframe, on='postcode')
similarities_dict = dict()
for i in range(len(internalDataframe)):
index = internalDataframe['index'].iloc[i]
similarities_dict[index] = 0
for i in range(len(combinedDf)):
vector1 = text_to_vector(clean_address(combinedDf['Address'].iloc[i]))
vector2 = text_to_vector(clean_address(combinedDf['full address'].iloc[i]))
cosine = get_cosine(vector1, vector2)
index = combinedDf['index'].iloc[i]
if cosine > similarities_dict[index]:
similarities_dict[index] = cosine
similarities = []
for key, value in similarities_dict.items():
similarities.append(value)
return similarities