仅使用非空公共列查找行之间余弦相似度的函数

Question

我想编写一个函数，仅使用公共列来查找索引行（查询）与数据帧中其他每一行之间的余弦相似度。我面临的问题是行之间的公共非空列可能不同。我曾尝试将这些值替换为 0，正如我之前提出类似问题时所建议的那样，但这不是我正在寻找的输出或方法，因此我试图在这里更加具体。例如我的查询是这样的：

     A    B   C   D    E   F
1    3   Nan  2   1   Nan  4

这包含在 similar_rows 数据框中：

     A    B   C   D    E   F
0    2    3  Nan  3    1  Nan
1    3   Nan  2   1   Nan  4
2    Nan  4   1   3   Nan  5

因此，应该在查询（在本例中为索引 1）与 0 和 2 之间仅使用它们的非空公共列分别找到余弦相似度。所以 0 和 1 之间的余弦相似度应该只使用 A 列和 D 列来找到，因为它们都是非空的。

到目前为止，我的函数如下所示：

def sims(index):
    #find_similar_times finds all times within a minutes threshold of the index row, not necessary to know that for this question but just giving some context
    similar_rows = find_similar_rows(index)
    #finding the columns of the query
    query_cols = similar_rows.loc[index]
    #getting them columns as a list after finding only non null columns of the query
    q_cols_names = query_cols[query_cols.notnull()]
    q_cols_names = list(q_cols_names.index)
    #putting the query into it's own dataframe
    qs = pd.DataFrame(query_cols[q_cols_names])
    qs = qs.T

    #this is where the error occurs. I am not sure why
    result = similar_rows[q_cols_names].apply(lambda row: cosine(row, qs))

    return result
    #the error says ('shapes (33,) and (24,) not aligned: 33 (dim 0) != 24 (dim 0)', (obviously my actual dataframe is different from above). I am not sure what this error is telling me

这是一个复杂的问题，如果不清楚，请提前致歉。非常感谢任何帮助。

Answer 1

def cosine_similarity(a, b):
    matrix = pd.DataFrame({"A": a, "B": b})
    matrix = matrix.dropna(axis = 0, how='any')
    a = matrix[['A']]
    b = matrix[['B']]
    return 1 - (cosine(a, b))

def sims(index):
    #find_similar_times finds all times within a minutes threshold of the index row, not necessary to know that for this question but just giving some context
    similar_rows = find_similar_rows(index)
    similar_rows = similar_rows.filter(like='to')
    #finding the columns of the query
    query_cols = list(similar_rows.loc[index])
    similar_rows = similar_rows.drop([index], axis = 0)
    #getting them columns as a list after finding only non null columns of the query
    result = similar_rows.apply(lambda row: cosine_similarity(list(row), query_cols), axis = 1)
    result = result.sort_values(ascending = False) #finding the most similar
    result = result.head(10)


    return result

仅使用非空公共列查找行之间余弦相似度的函数

Function to find cosine similarity between rows using non null common columns only

python

trigonometry

function

pandas