在熊猫数据框中用更好的替代品替换 for-loop 以进行相似性测量
Replacing for-loop with better alternatives in panda dataframes for similarity measurement
我正在创建一个函数,该函数将计算数据集(MxK 维度)中每条记录与另一个数据集(NxK 维度)中的记录的余弦相似度,其中 N 远小于 M。
当我在一个很小的数据集(例如 'iris' 数据集)上测试它时,下面的代码可以很好地完成工作。我担心当我有更大的数据集(10 万条记录和 100 多个变量)时它可能会遇到困难。
我知道在这种情况下不建议使用 for 循环,在这种情况下我有两个 for 循环。我想知道是否有人可以提出改进此代码的方法。
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def similarity_calculation(seed_data, pool_data):
# Create an empty dataframe to store the similarity scores
similarity_matrix = pd.DataFrame()
for indexi, rowi in pool_data.iterrows():
# Create an array to score similarity score for each record in pool data
similarity_score_array = []
for indexj, rowj in seed_data.iterrows():
# Fetch a single record from pool dataset
pool = rowi.values.reshape(1, -1)
# Fetch a single record from seed dataset
seed = rowj.values.reshape(1, -1)
# Measure similarity score between the two records
similarity_score = (cosine_similarity(pool, seed))[0][0]
similarity_score_array.append(similarity_score)
# Append the similarity score array as a new record to the similarity matrix
similarity_matrix = similarity_matrix.append(pd.Series(similarity_score_array), ignore_index=True)
Edit1:示例数据iris dataset使用如下
iris_data = pd.read_csv("iris_data.csv", header=0)
# Split the data into seeds and pool sets, excluding the species details
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
预期结果是
我新的精简代码(只有一个for循环)如下
def similarity_calculation_compact(seed_data, pool_data):
Array1 = pool_data.values
Array2 = seed_data.values
scores = []
for i in range(Array1.shape[0]):
scores.append(np.mean(cosine_similarity(Array1[None, i, :], Array2)))
final_data = pool_data.copy()
final_data['mean_similarity_score'] = scores
final_data = final_data.sort_values(by='mean_similarity_score', ascending=False)
return(final_data)
我得到的输出是
我期待相同的结果,因为这两个函数应该从池数据中获取与种子数据最相似(就平均余弦相似度而言)的记录。
不需要 for 循环,因为 cosine_similarity
将两个形状数组 (n_samples_X, n_features)
和 (n_samples_Y, n_features)
以及 returns 形状数组作为输入 (n_samples_X, n_samples_Y)
通过计算两个输入数组中每一对之间的余弦相似度。
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
iris_data = pd.read_csv("iris.csv", header=0)
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
np.mean(cosine_similarity(pool_set, seed_set), axis=1)
结果(排序后):
array([0.99952255, 0.99947777, 0.99947545, 0.99946886, 0.99946596, ...])
我正在创建一个函数,该函数将计算数据集(MxK 维度)中每条记录与另一个数据集(NxK 维度)中的记录的余弦相似度,其中 N 远小于 M。
当我在一个很小的数据集(例如 'iris' 数据集)上测试它时,下面的代码可以很好地完成工作。我担心当我有更大的数据集(10 万条记录和 100 多个变量)时它可能会遇到困难。
我知道在这种情况下不建议使用 for 循环,在这种情况下我有两个 for 循环。我想知道是否有人可以提出改进此代码的方法。
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def similarity_calculation(seed_data, pool_data):
# Create an empty dataframe to store the similarity scores
similarity_matrix = pd.DataFrame()
for indexi, rowi in pool_data.iterrows():
# Create an array to score similarity score for each record in pool data
similarity_score_array = []
for indexj, rowj in seed_data.iterrows():
# Fetch a single record from pool dataset
pool = rowi.values.reshape(1, -1)
# Fetch a single record from seed dataset
seed = rowj.values.reshape(1, -1)
# Measure similarity score between the two records
similarity_score = (cosine_similarity(pool, seed))[0][0]
similarity_score_array.append(similarity_score)
# Append the similarity score array as a new record to the similarity matrix
similarity_matrix = similarity_matrix.append(pd.Series(similarity_score_array), ignore_index=True)
Edit1:示例数据iris dataset使用如下
iris_data = pd.read_csv("iris_data.csv", header=0)
# Split the data into seeds and pool sets, excluding the species details
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
预期结果是
我新的精简代码(只有一个for循环)如下
def similarity_calculation_compact(seed_data, pool_data):
Array1 = pool_data.values
Array2 = seed_data.values
scores = []
for i in range(Array1.shape[0]):
scores.append(np.mean(cosine_similarity(Array1[None, i, :], Array2)))
final_data = pool_data.copy()
final_data['mean_similarity_score'] = scores
final_data = final_data.sort_values(by='mean_similarity_score', ascending=False)
return(final_data)
我得到的输出是
我期待相同的结果,因为这两个函数应该从池数据中获取与种子数据最相似(就平均余弦相似度而言)的记录。
不需要 for 循环,因为 cosine_similarity
将两个形状数组 (n_samples_X, n_features)
和 (n_samples_Y, n_features)
以及 returns 形状数组作为输入 (n_samples_X, n_samples_Y)
通过计算两个输入数组中每一对之间的余弦相似度。
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
iris_data = pd.read_csv("iris.csv", header=0)
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
np.mean(cosine_similarity(pool_set, seed_set), axis=1)
结果(排序后):
array([0.99952255, 0.99947777, 0.99947545, 0.99946886, 0.99946596, ...])