如何根据文档相似度对文本数据进行分组?
How to group text data based on document similarity?
考虑如下数据框
df = pd.DataFrame({'Questions': ['What are you doing?','What are you doing tonight?','What are you doing now?','What is your name?','What is your nick name?','What is your full name?','Shall we meet?',
'How are you doing?' ]})
问题
0 你在做什么?
1 你今晚要做什么?
2 你现在在做什么?
3 你叫什么名字?
4 你的昵称是什么?
5 你的全名是什么?
6 我们可以见面吗?
7 你好吗?
如何将具有相似问题的数据框分组?即如何获得如下所示的组
for _, i in df.groupby('similarity')['Questions']:
print(i,'\n')
6 我们可以见面吗?
名称:问题,数据类型:对象
3 你叫什么名字?
4 你的昵称是什么?
5 你的全名是什么?
名称:问题,数据类型:对象
0 你在做什么?
1 你今晚要做什么?
2 你现在在做什么?
7 你好吗?
名称:问题,数据类型:对象
有人问了一个类似的问题 here 但不太清楚,所以没有人回答这个问题
这是一个相当大的方法,它找到系列中所有元素之间的 normalized similarity score
,然后根据新获得的转换为字符串的相似性列表对它们进行分组。即
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
def convert_tag(tag):
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None
def doc_to_synsets(doc):
"""
Returns a list of synsets in document.
Tokenizes and tags the words in the document doc.
Then finds the first synset for each word/tag combination.
If a synset is not found for that combination it is skipped.
Args:
doc: string to be converted
Returns:
list of synsets
Example:
doc_to_synsets('Fish are nvqjp friends.')
Out: [Synset('fish.n.01'), Synset('be.v.01'),
Synset('friend.n.01')]
"""
synsetlist =[]
tokens=nltk.word_tokenize(doc)
pos=nltk.pos_tag(tokens)
for tup in pos:
try:
synsetlist.append(wn.synsets(tup[0], convert_tag(tup[1]))[0])
except:
continue
return synsetlist
def similarity_score(s1, s2):
"""
Calculate the normalized similarity score of s1 onto s2
For each synset in s1, finds the synset in s2 with the largest similarity value.
Sum of all of the largest similarity values and normalize this value by dividing it by the number of largest similarity values found.
Args:
s1, s2: list of synsets from doc_to_synsets
Returns:
normalized similarity score of s1 onto s2
Example:
synsets1 = doc_to_synsets('I like cats')
synsets2 = doc_to_synsets('I like dogs')
similarity_score(synsets1, synsets2)
Out: 0.73333333333333339
"""
highscores = []
for synset1 in s1:
highest_yet=0
for synset2 in s2:
try:
simscore=synset1.path_similarity(synset2)
if simscore>highest_yet:
highest_yet=simscore
except:
continue
if highest_yet>0:
highscores.append(highest_yet)
return sum(highscores)/len(highscores) if len(highscores) > 0 else 0
def document_path_similarity(doc1, doc2):
synsets1 = doc_to_synsets(doc1)
synsets2 = doc_to_synsets(doc2)
return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2
def similarity(x,df):
sim_score = []
for i in df['Questions']:
sim_score.append(document_path_similarity(x,i))
return sim_score
根据上面定义的方法我们现在可以做
df['similarity'] = df['Questions'].apply(lambda x : similarity(x,df)).astype(str)
for _, i in df.groupby('similarity')['Questions']:
print(i,'\n')
输出:
6 Shall we meet?
Name: Questions, dtype: object
3 What is your name?
4 What is your nick name?
5 What is your full name?
Name: Questions, dtype: object
0 What are you doing?
1 What are you doing tonight?
2 What are you doing now?
7 How are you doing?
Name: Questions, dtype: object
这不是解决问题的最佳方法,而且速度非常慢。任何新方法都受到高度赞赏。
您应该先对 list/dataframe 列中的所有名称进行排序,然后 运行 仅对 n-1 行的相似性代码进行排序,即对于每一行,将其与下一个元素进行比较。
如果两者相似,您可以将它们归类为 1 或 0 并通过列表进行解析。
而不是将每一行与 n^2.
的所有其他元素进行比较
考虑如下数据框
df = pd.DataFrame({'Questions': ['What are you doing?','What are you doing tonight?','What are you doing now?','What is your name?','What is your nick name?','What is your full name?','Shall we meet?',
'How are you doing?' ]})
问题 0 你在做什么? 1 你今晚要做什么? 2 你现在在做什么? 3 你叫什么名字? 4 你的昵称是什么? 5 你的全名是什么? 6 我们可以见面吗? 7 你好吗?
如何将具有相似问题的数据框分组?即如何获得如下所示的组
for _, i in df.groupby('similarity')['Questions']:
print(i,'\n')
6 我们可以见面吗? 名称:问题,数据类型:对象 3 你叫什么名字? 4 你的昵称是什么? 5 你的全名是什么? 名称:问题,数据类型:对象 0 你在做什么? 1 你今晚要做什么? 2 你现在在做什么? 7 你好吗? 名称:问题,数据类型:对象
有人问了一个类似的问题 here 但不太清楚,所以没有人回答这个问题
这是一个相当大的方法,它找到系列中所有元素之间的 normalized similarity score
,然后根据新获得的转换为字符串的相似性列表对它们进行分组。即
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
def convert_tag(tag):
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None
def doc_to_synsets(doc):
"""
Returns a list of synsets in document.
Tokenizes and tags the words in the document doc.
Then finds the first synset for each word/tag combination.
If a synset is not found for that combination it is skipped.
Args:
doc: string to be converted
Returns:
list of synsets
Example:
doc_to_synsets('Fish are nvqjp friends.')
Out: [Synset('fish.n.01'), Synset('be.v.01'),
Synset('friend.n.01')]
"""
synsetlist =[]
tokens=nltk.word_tokenize(doc)
pos=nltk.pos_tag(tokens)
for tup in pos:
try:
synsetlist.append(wn.synsets(tup[0], convert_tag(tup[1]))[0])
except:
continue
return synsetlist
def similarity_score(s1, s2):
"""
Calculate the normalized similarity score of s1 onto s2
For each synset in s1, finds the synset in s2 with the largest similarity value.
Sum of all of the largest similarity values and normalize this value by dividing it by the number of largest similarity values found.
Args:
s1, s2: list of synsets from doc_to_synsets
Returns:
normalized similarity score of s1 onto s2
Example:
synsets1 = doc_to_synsets('I like cats')
synsets2 = doc_to_synsets('I like dogs')
similarity_score(synsets1, synsets2)
Out: 0.73333333333333339
"""
highscores = []
for synset1 in s1:
highest_yet=0
for synset2 in s2:
try:
simscore=synset1.path_similarity(synset2)
if simscore>highest_yet:
highest_yet=simscore
except:
continue
if highest_yet>0:
highscores.append(highest_yet)
return sum(highscores)/len(highscores) if len(highscores) > 0 else 0
def document_path_similarity(doc1, doc2):
synsets1 = doc_to_synsets(doc1)
synsets2 = doc_to_synsets(doc2)
return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2
def similarity(x,df):
sim_score = []
for i in df['Questions']:
sim_score.append(document_path_similarity(x,i))
return sim_score
根据上面定义的方法我们现在可以做
df['similarity'] = df['Questions'].apply(lambda x : similarity(x,df)).astype(str)
for _, i in df.groupby('similarity')['Questions']:
print(i,'\n')
输出:
6 Shall we meet? Name: Questions, dtype: object 3 What is your name? 4 What is your nick name? 5 What is your full name? Name: Questions, dtype: object 0 What are you doing? 1 What are you doing tonight? 2 What are you doing now? 7 How are you doing? Name: Questions, dtype: object
这不是解决问题的最佳方法,而且速度非常慢。任何新方法都受到高度赞赏。
您应该先对 list/dataframe 列中的所有名称进行排序,然后 运行 仅对 n-1 行的相似性代码进行排序,即对于每一行,将其与下一个元素进行比较。 如果两者相似,您可以将它们归类为 1 或 0 并通过列表进行解析。 而不是将每一行与 n^2.
的所有其他元素进行比较