应该将什么作为链接函数的输入 - tfidf 矩阵或 tfidf 矩阵的不同元素之间的相似性?
What should be given as an input to linkage function - tfidf matrix or similarity between different elements of tfidf matrixes?
我有以下 python 笔记本,它旨在根据文本之间的相似性对不同的摘要组进行聚类。
我这里有两种方法:一种是使用 linkage 函数中的 tfidf numpy 文档数组,第二种是找到不同文档的 tfidf 数组之间的相似性,然后使用该相似性矩阵进行聚类。我无法理解哪个是正确的。
方法一:
我用cosine_similarity找出tfidf矩阵的相似度矩阵(余弦)。我首先使用 squareform 函数将冗余矩阵 (cosine) 转换为压缩距离矩阵 (distance_matrix) 。然后 distance_matrix 被输入 linkage 函数并使用 Dendograms 我绘制了图表。
方法二:
我将 tfidf numpy 数组的压缩形式用于 linkage 函数并绘制了树状图。
我的问题是什么是正确的?据我所知,根据数据,方法 2 似乎是正确的,但对我来说方法 1 是有道理的。如果有人能向我解释在这种情况下这里是什么,那就太好了。提前致谢。
如果问题中有任何不清楚的地方,请告诉我。
import pandas, numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
###Data Cleaning
stop_words = stopwords.words('english')
tokenizer = RegexpTokenizer(r'\w+')
df=pandas.read_csv('WIPO_CSV.csv')
import sys
reload(sys)
sys.setdefaultencoding('utf8')
documents_no_stopwords=[]
def preprocessing(word):
tokens = tokenizer.tokenize(word)
processed_words = []
for w in tokens:
if w in stop_words:
continue
else:
processed_words.append(w)
***This step creates a list of text documents with only the nouns in them***
documents_no_stopwords.append(' '.join(processed_words))
for text in df['TEXT'].tolist():
preprocessing(text)
***Converting into tfidf form***
*Latin is used as utf8 decoder was facing some trouble with the text.*
vectoriser = TfidfVectorizer(encoding='latin1')
***we have numpy here which is in normalised form***
tfidf_documents = vectoriser.fit_transform(documents_no_stopwords)
##Cosine Similarity as the input to linkage should be a distance vector
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import squareform
cosine = cosine_similarity(tfidf_documents)
distance_matrix = squareform(cosine,force='tovector',checks=False)
from scipy.cluster.hierarchy import dendrogram, linkage
##Linkage based on tfidf of each document
z_num=linkage(tfidf_documents.todense(),'ward')
z_num #tfidf
array([[11. , 12. , 0. , 2. ],
[18. , 19. , 0. , 2. ],
[20. , 31. , 0. , 3. ],
[21. , 32. , 0. , 4. ],
[22. , 33. , 0. , 5. ],
[17. , 34. , 0.38208619, 6. ],
[15. , 28. , 1.19375843, 2. ],
[ 6. , 9. , 1.24241258, 2. ],
[ 7. , 8. , 1.27069483, 2. ],
[13. , 37. , 1.28868301, 3. ],
[ 4. , 24. , 1.30850122, 2. ],
[36. , 39. , 1.32090275, 5. ],
[10. , 16. , 1.32602346, 2. ],
[27. , 38. , 1.32934025, 3. ],
[23. , 25. , 1.32987072, 2. ],
[ 3. , 29. , 1.35143582, 2. ],
[ 5. , 14. , 1.35401753, 2. ],
[26. , 42. , 1.35994878, 3. ],
[ 2. , 45. , 1.40055438, 3. ],
[ 0. , 40. , 1.40811825, 3. ],
[ 1. , 46. , 1.41383622, 3. ],
[44. , 50. , 1.4379821 , 5. ],
[41. , 43. , 1.44575227, 8. ],
[48. , 51. , 1.45876241, 8. ],
[49. , 53. , 1.47130328, 11. ],
[47. , 52. , 1.49944936, 11. ],
[54. , 55. , 1.69814818, 22. ],
[30. , 56. , 1.91299937, 24. ],
[35. , 57. , 3.1967033 , 30. ]])
from matplotlib import pyplot as plt
plt.figure(figsize=(25, 10))
dn = dendrogram(z_num)
plt.show()
基于相似度的链接
z_sim=linkage(distance_matrix,'ward')
z_sim *Cosine Similarity*
array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 2.00000000e+00],
[2.00000000e+00, 3.00000000e+01, 0.00000000e+00, 3.00000000e+00],
[1.70000000e+01, 3.10000000e+01, 0.00000000e+00, 4.00000000e+00],
[3.00000000e+00, 4.00000000e+00, 0.00000000e+00, 2.00000000e+00],
[1.00000000e+01, 3.30000000e+01, 0.00000000e+00, 3.00000000e+00],
[5.00000000e+00, 7.00000000e+00, 0.00000000e+00, 2.00000000e+00],
[6.00000000e+00, 1.80000000e+01, 0.00000000e+00, 2.00000000e+00],
[1.10000000e+01, 1.90000000e+01, 0.00000000e+00, 2.00000000e+00],
[1.20000000e+01, 2.00000000e+01, 0.00000000e+00, 2.00000000e+00],
[8.00000000e+00, 2.40000000e+01, 0.00000000e+00, 2.00000000e+00],
[1.60000000e+01, 2.10000000e+01, 0.00000000e+00, 2.00000000e+00],
[2.20000000e+01, 2.70000000e+01, 0.00000000e+00, 2.00000000e+00],
[9.00000000e+00, 2.90000000e+01, 0.00000000e+00, 2.00000000e+00],
[2.60000000e+01, 4.20000000e+01, 0.00000000e+00, 3.00000000e+00],
[1.40000000e+01, 3.40000000e+01, 3.97089886e-03, 4.00000000e+00],
[2.30000000e+01, 4.40000000e+01, 1.81733052e-02, 5.00000000e+00],
[3.20000000e+01, 3.50000000e+01, 2.14592323e-02, 6.00000000e+00],
[2.50000000e+01, 4.00000000e+01, 2.84944415e-02, 3.00000000e+00],
[1.30000000e+01, 4.70000000e+01, 5.02045376e-02, 4.00000000e+00],
[4.10000000e+01, 4.30000000e+01, 5.10902795e-02, 5.00000000e+00],
[3.70000000e+01, 4.50000000e+01, 5.40176402e-02, 7.00000000e+00],
[3.80000000e+01, 3.90000000e+01, 6.15118462e-02, 4.00000000e+00],
[1.50000000e+01, 4.60000000e+01, 7.54874869e-02, 7.00000000e+00],
[2.80000000e+01, 5.00000000e+01, 9.55487454e-02, 8.00000000e+00],
[5.20000000e+01, 5.30000000e+01, 3.86911095e-01, 1.50000000e+01],
[4.90000000e+01, 5.40000000e+01, 4.16693529e-01, 2.00000000e+01],
[4.80000000e+01, 5.50000000e+01, 4.58764920e-01, 2.40000000e+01],
[3.60000000e+01, 5.60000000e+01, 5.23422380e-01, 2.60000000e+01],
[5.10000000e+01, 5.70000000e+01, 5.49419077e-01, 3.00000000e+01]])
from matplotlib import pyplot as plt
plt.figure(figsize=(25, 10))
dn = dendrogram(z_sim)
plt.show()
数据聚类的准确性与这张照片进行比较:https://drive.google.com/file/d/1EgkPqwh7AKhGqOe1zf9KNjSMxPQ9Xfd9/view?usp=sharing
我得到的树状图可以在以下笔记本中找到link:https://drive.google.com/file/d/1TB7aFK4lPDo43GY74FPOqVOx1AxWV-A_/view?usp=sharing
使用互联网浏览器打开此 html。
Scipy 仅支持 HAC 的 距离,不支持相似性。
那么结果应该是一样的。所以没有 "right" 或 "wrong".
有时您需要线性化形式的距离矩阵。使用 a) 可以处理 sparse 数据(避免任何 todense
调用)的方法可能是最有效的,b) 直接生成线性化形式,而不是生成整个矩阵,然后丢弃一半。
我有以下 python 笔记本,它旨在根据文本之间的相似性对不同的摘要组进行聚类。 我这里有两种方法:一种是使用 linkage 函数中的 tfidf numpy 文档数组,第二种是找到不同文档的 tfidf 数组之间的相似性,然后使用该相似性矩阵进行聚类。我无法理解哪个是正确的。
方法一:
我用cosine_similarity找出tfidf矩阵的相似度矩阵(余弦)。我首先使用 squareform 函数将冗余矩阵 (cosine) 转换为压缩距离矩阵 (distance_matrix) 。然后 distance_matrix 被输入 linkage 函数并使用 Dendograms 我绘制了图表。
方法二:
我将 tfidf numpy 数组的压缩形式用于 linkage 函数并绘制了树状图。
我的问题是什么是正确的?据我所知,根据数据,方法 2 似乎是正确的,但对我来说方法 1 是有道理的。如果有人能向我解释在这种情况下这里是什么,那就太好了。提前致谢。
如果问题中有任何不清楚的地方,请告诉我。
import pandas, numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
###Data Cleaning
stop_words = stopwords.words('english')
tokenizer = RegexpTokenizer(r'\w+')
df=pandas.read_csv('WIPO_CSV.csv')
import sys
reload(sys)
sys.setdefaultencoding('utf8')
documents_no_stopwords=[]
def preprocessing(word):
tokens = tokenizer.tokenize(word)
processed_words = []
for w in tokens:
if w in stop_words:
continue
else:
processed_words.append(w)
***This step creates a list of text documents with only the nouns in them***
documents_no_stopwords.append(' '.join(processed_words))
for text in df['TEXT'].tolist():
preprocessing(text)
***Converting into tfidf form***
*Latin is used as utf8 decoder was facing some trouble with the text.*
vectoriser = TfidfVectorizer(encoding='latin1')
***we have numpy here which is in normalised form***
tfidf_documents = vectoriser.fit_transform(documents_no_stopwords)
##Cosine Similarity as the input to linkage should be a distance vector
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import squareform
cosine = cosine_similarity(tfidf_documents)
distance_matrix = squareform(cosine,force='tovector',checks=False)
from scipy.cluster.hierarchy import dendrogram, linkage
##Linkage based on tfidf of each document
z_num=linkage(tfidf_documents.todense(),'ward')
z_num #tfidf
array([[11. , 12. , 0. , 2. ],
[18. , 19. , 0. , 2. ],
[20. , 31. , 0. , 3. ],
[21. , 32. , 0. , 4. ],
[22. , 33. , 0. , 5. ],
[17. , 34. , 0.38208619, 6. ],
[15. , 28. , 1.19375843, 2. ],
[ 6. , 9. , 1.24241258, 2. ],
[ 7. , 8. , 1.27069483, 2. ],
[13. , 37. , 1.28868301, 3. ],
[ 4. , 24. , 1.30850122, 2. ],
[36. , 39. , 1.32090275, 5. ],
[10. , 16. , 1.32602346, 2. ],
[27. , 38. , 1.32934025, 3. ],
[23. , 25. , 1.32987072, 2. ],
[ 3. , 29. , 1.35143582, 2. ],
[ 5. , 14. , 1.35401753, 2. ],
[26. , 42. , 1.35994878, 3. ],
[ 2. , 45. , 1.40055438, 3. ],
[ 0. , 40. , 1.40811825, 3. ],
[ 1. , 46. , 1.41383622, 3. ],
[44. , 50. , 1.4379821 , 5. ],
[41. , 43. , 1.44575227, 8. ],
[48. , 51. , 1.45876241, 8. ],
[49. , 53. , 1.47130328, 11. ],
[47. , 52. , 1.49944936, 11. ],
[54. , 55. , 1.69814818, 22. ],
[30. , 56. , 1.91299937, 24. ],
[35. , 57. , 3.1967033 , 30. ]])
from matplotlib import pyplot as plt
plt.figure(figsize=(25, 10))
dn = dendrogram(z_num)
plt.show()
基于相似度的链接
z_sim=linkage(distance_matrix,'ward')
z_sim *Cosine Similarity*
array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 2.00000000e+00],
[2.00000000e+00, 3.00000000e+01, 0.00000000e+00, 3.00000000e+00],
[1.70000000e+01, 3.10000000e+01, 0.00000000e+00, 4.00000000e+00],
[3.00000000e+00, 4.00000000e+00, 0.00000000e+00, 2.00000000e+00],
[1.00000000e+01, 3.30000000e+01, 0.00000000e+00, 3.00000000e+00],
[5.00000000e+00, 7.00000000e+00, 0.00000000e+00, 2.00000000e+00],
[6.00000000e+00, 1.80000000e+01, 0.00000000e+00, 2.00000000e+00],
[1.10000000e+01, 1.90000000e+01, 0.00000000e+00, 2.00000000e+00],
[1.20000000e+01, 2.00000000e+01, 0.00000000e+00, 2.00000000e+00],
[8.00000000e+00, 2.40000000e+01, 0.00000000e+00, 2.00000000e+00],
[1.60000000e+01, 2.10000000e+01, 0.00000000e+00, 2.00000000e+00],
[2.20000000e+01, 2.70000000e+01, 0.00000000e+00, 2.00000000e+00],
[9.00000000e+00, 2.90000000e+01, 0.00000000e+00, 2.00000000e+00],
[2.60000000e+01, 4.20000000e+01, 0.00000000e+00, 3.00000000e+00],
[1.40000000e+01, 3.40000000e+01, 3.97089886e-03, 4.00000000e+00],
[2.30000000e+01, 4.40000000e+01, 1.81733052e-02, 5.00000000e+00],
[3.20000000e+01, 3.50000000e+01, 2.14592323e-02, 6.00000000e+00],
[2.50000000e+01, 4.00000000e+01, 2.84944415e-02, 3.00000000e+00],
[1.30000000e+01, 4.70000000e+01, 5.02045376e-02, 4.00000000e+00],
[4.10000000e+01, 4.30000000e+01, 5.10902795e-02, 5.00000000e+00],
[3.70000000e+01, 4.50000000e+01, 5.40176402e-02, 7.00000000e+00],
[3.80000000e+01, 3.90000000e+01, 6.15118462e-02, 4.00000000e+00],
[1.50000000e+01, 4.60000000e+01, 7.54874869e-02, 7.00000000e+00],
[2.80000000e+01, 5.00000000e+01, 9.55487454e-02, 8.00000000e+00],
[5.20000000e+01, 5.30000000e+01, 3.86911095e-01, 1.50000000e+01],
[4.90000000e+01, 5.40000000e+01, 4.16693529e-01, 2.00000000e+01],
[4.80000000e+01, 5.50000000e+01, 4.58764920e-01, 2.40000000e+01],
[3.60000000e+01, 5.60000000e+01, 5.23422380e-01, 2.60000000e+01],
[5.10000000e+01, 5.70000000e+01, 5.49419077e-01, 3.00000000e+01]])
from matplotlib import pyplot as plt
plt.figure(figsize=(25, 10))
dn = dendrogram(z_sim)
plt.show()
数据聚类的准确性与这张照片进行比较:https://drive.google.com/file/d/1EgkPqwh7AKhGqOe1zf9KNjSMxPQ9Xfd9/view?usp=sharing
我得到的树状图可以在以下笔记本中找到link:https://drive.google.com/file/d/1TB7aFK4lPDo43GY74FPOqVOx1AxWV-A_/view?usp=sharing 使用互联网浏览器打开此 html。
Scipy 仅支持 HAC 的 距离,不支持相似性。
那么结果应该是一样的。所以没有 "right" 或 "wrong".
有时您需要线性化形式的距离矩阵。使用 a) 可以处理 sparse 数据(避免任何 todense
调用)的方法可能是最有效的,b) 直接生成线性化形式,而不是生成整个矩阵,然后丢弃一半。