仅使用非空公共列查找行之间余弦相似度的函数
Function to find cosine similarity between rows using non null common columns only
我想编写一个函数,仅使用公共列来查找索引行(查询)与数据帧中其他每一行之间的余弦相似度。我面临的问题是行之间的公共非空列可能不同。我曾尝试将这些值替换为 0,正如我之前提出类似问题时所建议的那样,但这不是我正在寻找的输出或方法,因此我试图在这里更加具体。例如我的查询是这样的:
A B C D E F
1 3 Nan 2 1 Nan 4
这包含在 similar_rows 数据框中:
A B C D E F
0 2 3 Nan 3 1 Nan
1 3 Nan 2 1 Nan 4
2 Nan 4 1 3 Nan 5
因此,应该在查询(在本例中为索引 1)与 0 和 2 之间仅使用它们的非空公共列分别找到余弦相似度。所以 0 和 1 之间的余弦相似度应该只使用 A 列和 D 列来找到,因为它们都是非空的。
到目前为止,我的函数如下所示:
def sims(index):
#find_similar_times finds all times within a minutes threshold of the index row, not necessary to know that for this question but just giving some context
similar_rows = find_similar_rows(index)
#finding the columns of the query
query_cols = similar_rows.loc[index]
#getting them columns as a list after finding only non null columns of the query
q_cols_names = query_cols[query_cols.notnull()]
q_cols_names = list(q_cols_names.index)
#putting the query into it's own dataframe
qs = pd.DataFrame(query_cols[q_cols_names])
qs = qs.T
#this is where the error occurs. I am not sure why
result = similar_rows[q_cols_names].apply(lambda row: cosine(row, qs))
return result
#the error says ('shapes (33,) and (24,) not aligned: 33 (dim 0) != 24 (dim 0)', (obviously my actual dataframe is different from above). I am not sure what this error is telling me
这是一个复杂的问题,如果不清楚,请提前致歉。非常感谢任何帮助。
def cosine_similarity(a, b):
matrix = pd.DataFrame({"A": a, "B": b})
matrix = matrix.dropna(axis = 0, how='any')
a = matrix[['A']]
b = matrix[['B']]
return 1 - (cosine(a, b))
def sims(index):
#find_similar_times finds all times within a minutes threshold of the index row, not necessary to know that for this question but just giving some context
similar_rows = find_similar_rows(index)
similar_rows = similar_rows.filter(like='to')
#finding the columns of the query
query_cols = list(similar_rows.loc[index])
similar_rows = similar_rows.drop([index], axis = 0)
#getting them columns as a list after finding only non null columns of the query
result = similar_rows.apply(lambda row: cosine_similarity(list(row), query_cols), axis = 1)
result = result.sort_values(ascending = False) #finding the most similar
result = result.head(10)
return result
我想编写一个函数,仅使用公共列来查找索引行(查询)与数据帧中其他每一行之间的余弦相似度。我面临的问题是行之间的公共非空列可能不同。我曾尝试将这些值替换为 0,正如我之前提出类似问题时所建议的那样,但这不是我正在寻找的输出或方法,因此我试图在这里更加具体。例如我的查询是这样的:
A B C D E F
1 3 Nan 2 1 Nan 4
这包含在 similar_rows 数据框中:
A B C D E F
0 2 3 Nan 3 1 Nan
1 3 Nan 2 1 Nan 4
2 Nan 4 1 3 Nan 5
因此,应该在查询(在本例中为索引 1)与 0 和 2 之间仅使用它们的非空公共列分别找到余弦相似度。所以 0 和 1 之间的余弦相似度应该只使用 A 列和 D 列来找到,因为它们都是非空的。
到目前为止,我的函数如下所示:
def sims(index):
#find_similar_times finds all times within a minutes threshold of the index row, not necessary to know that for this question but just giving some context
similar_rows = find_similar_rows(index)
#finding the columns of the query
query_cols = similar_rows.loc[index]
#getting them columns as a list after finding only non null columns of the query
q_cols_names = query_cols[query_cols.notnull()]
q_cols_names = list(q_cols_names.index)
#putting the query into it's own dataframe
qs = pd.DataFrame(query_cols[q_cols_names])
qs = qs.T
#this is where the error occurs. I am not sure why
result = similar_rows[q_cols_names].apply(lambda row: cosine(row, qs))
return result
#the error says ('shapes (33,) and (24,) not aligned: 33 (dim 0) != 24 (dim 0)', (obviously my actual dataframe is different from above). I am not sure what this error is telling me
这是一个复杂的问题,如果不清楚,请提前致歉。非常感谢任何帮助。
def cosine_similarity(a, b):
matrix = pd.DataFrame({"A": a, "B": b})
matrix = matrix.dropna(axis = 0, how='any')
a = matrix[['A']]
b = matrix[['B']]
return 1 - (cosine(a, b))
def sims(index):
#find_similar_times finds all times within a minutes threshold of the index row, not necessary to know that for this question but just giving some context
similar_rows = find_similar_rows(index)
similar_rows = similar_rows.filter(like='to')
#finding the columns of the query
query_cols = list(similar_rows.loc[index])
similar_rows = similar_rows.drop([index], axis = 0)
#getting them columns as a list after finding only non null columns of the query
result = similar_rows.apply(lambda row: cosine_similarity(list(row), query_cols), axis = 1)
result = result.sort_values(ascending = False) #finding the most similar
result = result.head(10)
return result