如何使用 python 计算数据框中每一行的成对 Jaccard 相似度得分
How to calculate pairwise Jaccard similarity score for every row in a data frame using python
我有如下DF:
df=pd.DataFrame.from_dict({"q1":['What is the step by step guide to invest in share market in india?',
'What is the story of Kohinoor (Koh-i-Noor) Diamond?',
'How can I increase the speed of my internet connection while using a VPN?',
'Why am I mentally very lonely? How can I solve it?',
'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?'],
"q2":['What is the step by step guide to invest in share market?',
'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',
'How can Internet speed be increased by hacking through DNS?',
'Find the remainder when [math]23^{24}[/math] is divided by 24,23?',
'Which fish would survive in salt water?']})
df
我正在尝试迭代地查找 q1 和 q2 列的每对句子之间的 Jaccard 相似度得分(使用列表推导映射或应用函数)(创建一个新的库 jac_q1_q2。
对于单行,可以这样做:
import nltk
jd_sent_1_2 = nltk.jaccard_distance(set(df['q1'][0]), set(df['q2'][0]))
jd_sent_1_2
>0.0
谢谢
可以使用 apply()
和 lambda
函数来完成
scores = df.apply(lambda row: nltk.jaccard_distance(set(row['q1']), set(row['q2']), axis=1)
一个可以用data_sim['jac_sim'] = [nltk.jaccard_distance(text1, text2) for text1, text2 in zip(data_sim['q1'], data_sim['q2'])]
我有如下DF:
df=pd.DataFrame.from_dict({"q1":['What is the step by step guide to invest in share market in india?',
'What is the story of Kohinoor (Koh-i-Noor) Diamond?',
'How can I increase the speed of my internet connection while using a VPN?',
'Why am I mentally very lonely? How can I solve it?',
'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?'],
"q2":['What is the step by step guide to invest in share market?',
'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',
'How can Internet speed be increased by hacking through DNS?',
'Find the remainder when [math]23^{24}[/math] is divided by 24,23?',
'Which fish would survive in salt water?']})
df
我正在尝试迭代地查找 q1 和 q2 列的每对句子之间的 Jaccard 相似度得分(使用列表推导映射或应用函数)(创建一个新的库 jac_q1_q2。
对于单行,可以这样做:
import nltk
jd_sent_1_2 = nltk.jaccard_distance(set(df['q1'][0]), set(df['q2'][0]))
jd_sent_1_2
>0.0
谢谢
可以使用 apply()
和 lambda
函数来完成
scores = df.apply(lambda row: nltk.jaccard_distance(set(row['q1']), set(row['q2']), axis=1)
一个可以用data_sim['jac_sim'] = [nltk.jaccard_distance(text1, text2) for text1, text2 in zip(data_sim['q1'], data_sim['q2'])]