如何为 sklearn 聚类算法准备 pandas 字符串数据 table?
How to prepare pandas string data table for sklearn clustering algorithm?
群组
角色
用户
出现次数
格斯
DEFAULT_M
馅饼
47251
RSS
DEFAULT_R
馅饼
27057
RRD
DEFAULT_M
达纳特
21251
注意事项
DEFAULT_R
博尼
17933
GTS
DEFAULT_Q
博尼
16067
我有大约 5000 行像上面这样的数据,我正在尝试制作一个聚类算法来了解哪些用户属于某个组。它将创建一个包含用户的组集群。当我尝试使用 sklearn 库来制作聚类算法时,不幸的是它告诉我数据需要是 int 或 float。它找不到这些词之间的距离。有没有办法我仍然可以在这些字符串数据帧上使用 sklearn k-means 算法来聚类用户组?另一种方法是将组和用户转换为数字,这将花费很长时间,我需要保留组和用户的字典。如果我这样做,是否有更简单的方法将组和用户转换为数字以便聚类算法可以解释?预先感谢您的帮助
据我所知,每个算法都适用于数字,或者将文本转换为数字,然后完成它的工作。也许你可以试试这个。
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
结果:
- *LDPELDKSL:* LDPELDKSL
- *DFKLKSLFD:* DFKLKSLFD
- *XYZ:* ABC, XYZ
- *DLFKFKDLD:* DLFKFKDLD
或者...
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
print("\n")
print("Prediction")
Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)
Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)
结果...每个集群的热门术语:
Cluster 0:
kitten
belly
squooshy
merley
best
eating
google
feedback
face
extension
Cluster 1:
impressed
map
feedback
google
ve
eating
face
extension
climbing
key
Cluster 2:
climbing
ninja
cat
eating
impressed
google
feedback
face
extension
ve
Cluster 3:
eating
kitty
little
came
restaurant
play
ve
feedback
face
extension
Cluster 4:
100
open
tab
smiley
face
google
feedback
extension
eating
climbing
Cluster 5:
chrome
extension
promoter
key
google
eating
impressed
feedback
face
ve
Cluster 6:
translate
app
incredible
google
eating
impressed
feedback
face
extension
ve
Cluster 7:
ve
taken
photo
best
cat
eating
google
feedback
face
extension
群组 | 角色 | 用户 | 出现次数 |
---|---|---|---|
格斯 | DEFAULT_M | 馅饼 | 47251 |
RSS | DEFAULT_R | 馅饼 | 27057 |
RRD | DEFAULT_M | 达纳特 | 21251 |
注意事项 | DEFAULT_R | 博尼 | 17933 |
GTS | DEFAULT_Q | 博尼 | 16067 |
我有大约 5000 行像上面这样的数据,我正在尝试制作一个聚类算法来了解哪些用户属于某个组。它将创建一个包含用户的组集群。当我尝试使用 sklearn 库来制作聚类算法时,不幸的是它告诉我数据需要是 int 或 float。它找不到这些词之间的距离。有没有办法我仍然可以在这些字符串数据帧上使用 sklearn k-means 算法来聚类用户组?另一种方法是将组和用户转换为数字,这将花费很长时间,我需要保留组和用户的字典。如果我这样做,是否有更简单的方法将组和用户转换为数字以便聚类算法可以解释?预先感谢您的帮助
据我所知,每个算法都适用于数字,或者将文本转换为数字,然后完成它的工作。也许你可以试试这个。
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
结果:
- *LDPELDKSL:* LDPELDKSL
- *DFKLKSLFD:* DFKLKSLFD
- *XYZ:* ABC, XYZ
- *DLFKFKDLD:* DLFKFKDLD
或者...
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
print("\n")
print("Prediction")
Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)
Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)
结果...每个集群的热门术语:
Cluster 0:
kitten
belly
squooshy
merley
best
eating
google
feedback
face
extension
Cluster 1:
impressed
map
feedback
google
ve
eating
face
extension
climbing
key
Cluster 2:
climbing
ninja
cat
eating
impressed
google
feedback
face
extension
ve
Cluster 3:
eating
kitty
little
came
restaurant
play
ve
feedback
face
extension
Cluster 4:
100
open
tab
smiley
face
google
feedback
extension
eating
climbing
Cluster 5:
chrome
extension
promoter
key
google
eating
impressed
feedback
face
ve
Cluster 6:
translate
app
incredible
google
eating
impressed
feedback
face
extension
ve
Cluster 7:
ve
taken
photo
best
cat
eating
google
feedback
face
extension