通过比较 Pandas Dataframe 中的所有行同时跟踪被比较的行来获取 Jaccard 相似度
Get Jaccard Similarity by Comparing All Rows in A Pandas Dataframe While Keeping Track of Rows Being Compared
您好,我想获取数据框中所有行之间的 Jaccard 相似度。
我已经有一个像下面这样的 jaccard 相似度函数,它包含两个列表,但我无法理解如何跟踪正在进行比较的用户。
def jaccard_similarity(x,y):
""" returns the jaccard similarity between two lists """
intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
union_cardinality = len(set.union(*[set(x), set(y)]))
return intersection_cardinality/float(union_cardinality)
我想运行这个函数针对数据框中的所有行。
wordings
users
apple,banana,orange,pears
adeline
banana,jackfruit,berries,apple
ericko
berries,grapes,watermelon
mary
如何生成如下所示的输出,以便跟踪正在比较的用户?
user1
user2
similarity
adeline
eriko
0.5
adeline
mary
0.2
非常感谢您的指导。
sentences = ['apple,banana,orange,pears', 'banana,jackfruit,berries,apple']
sentences = [sent.lower().split(",") for sent in sentences]
jaccard_similarity(sentences[0], sentences[1])
输出:0.3333333333333333
运行 上面的代码会让我得到我想要的值,但我只是停留在如何跟踪数据框中正在比较的用户,如果我有 100 行数据.
谢谢
可能的解决方案如下:
import itertools
import pandas as pd
# copied from OP above
def jaccard_similarity(x, y):
""" returns the jaccard similarity between two lists """
intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
union_cardinality = len(set.union(*[set(x), set(y)]))
return intersection_cardinality/float(union_cardinality)
# set initial data and create dataframe
data = {"wordings": ["apple,banana,orange,pears", "banana,jackfruit,berries,apple", "berries,grapes,watermelon"], "users": ["adeline", "ericko", "mary"]}
df = pd.DataFrame(data)
# create list of tuples like [(wording, user), (wording, user)]
wordings_users = list(zip(df["wordings"], df["users"]))
result = []
# create list of all possible combinations between sets of (wording, user) and loop through them
for item in list(itertools.combinations(wordings_users, 2)):
similarity = jaccard_similarity(item[0][0], item[1][0])
data = {"user1": item[0][1], "user2": item[1][1], "similarity": similarity}
result.append(data)
df1 = pd.DataFrame(result)
df1
Returns
您好,我想获取数据框中所有行之间的 Jaccard 相似度。
我已经有一个像下面这样的 jaccard 相似度函数,它包含两个列表,但我无法理解如何跟踪正在进行比较的用户。
def jaccard_similarity(x,y):
""" returns the jaccard similarity between two lists """
intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
union_cardinality = len(set.union(*[set(x), set(y)]))
return intersection_cardinality/float(union_cardinality)
我想运行这个函数针对数据框中的所有行。
wordings | users |
---|---|
apple,banana,orange,pears | adeline |
banana,jackfruit,berries,apple | ericko |
berries,grapes,watermelon | mary |
如何生成如下所示的输出,以便跟踪正在比较的用户?
user1 | user2 | similarity |
---|---|---|
adeline | eriko | 0.5 |
adeline | mary | 0.2 |
非常感谢您的指导。
sentences = ['apple,banana,orange,pears', 'banana,jackfruit,berries,apple']
sentences = [sent.lower().split(",") for sent in sentences]
jaccard_similarity(sentences[0], sentences[1])
输出:0.3333333333333333
运行 上面的代码会让我得到我想要的值,但我只是停留在如何跟踪数据框中正在比较的用户,如果我有 100 行数据.
谢谢
可能的解决方案如下:
import itertools
import pandas as pd
# copied from OP above
def jaccard_similarity(x, y):
""" returns the jaccard similarity between two lists """
intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
union_cardinality = len(set.union(*[set(x), set(y)]))
return intersection_cardinality/float(union_cardinality)
# set initial data and create dataframe
data = {"wordings": ["apple,banana,orange,pears", "banana,jackfruit,berries,apple", "berries,grapes,watermelon"], "users": ["adeline", "ericko", "mary"]}
df = pd.DataFrame(data)
# create list of tuples like [(wording, user), (wording, user)]
wordings_users = list(zip(df["wordings"], df["users"]))
result = []
# create list of all possible combinations between sets of (wording, user) and loop through them
for item in list(itertools.combinations(wordings_users, 2)):
similarity = jaccard_similarity(item[0][0], item[1][0])
data = {"user1": item[0][1], "user2": item[1][1], "similarity": similarity}
result.append(data)
df1 = pd.DataFrame(result)
df1
Returns