将小型监督数据集的 KNN 应用到 Python 中的大型无监督数据集
Apply KNN from small supervised dataset to large unsupervised dataset in Python
我在 Python 中的约 200 个样本的小型监督数据集上训练和测试了 KNN 模型。我想将这些结果应用于包含数千个样本的更大的无监督数据集。
我的问题是:有没有办法用小的监督数据集拟合KNN模型,然后改变大的无监督数据集的K值?我不想通过使用较小数据集中的低 K 值来过度拟合模型,但我不确定如何拟合模型然后更改 Python.
中的 K 值
这可以使用 KNN 吗?有没有其他方法可以将 KNN 应用于更大的无监督数据集?
在机器学习中,有两大类学习器,即热切学习器(决策树、神经网络、svms...)和懒惰学习器,例如 KNN
。事实上,KNN
根本没有进行任何学习。它只存储您拥有的 "labeled" 数据,然后使用它来执行推理,以便计算新样本(未标记)与它存储的数据(标记数据)中的所有样本的相似程度。然后根据新样本的 K
个最近实例(K
个最近的邻居因此得名)的多数投票,它将推断它是 class/value。
现在开始回答您的问题,"training" KNN
与 K
本身无关,因此在进行推理时可以随意使用 K
给你最好的结果。
我建议实际在较大的数据集上拟合 KNN 模型几次,每次使用不同的值 k
。对于这些模型中的每一个,您都可以计算 Silhouette Score。
比较各种剪影得分,并为 k
(聚类数)的最终值选择您用于得分最高的模型的值。
举个例子,这是我去年为自己做的一些代码:
from sklearn import mixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
## A list of the different numbers of clusters (the 'n_components' parameter) with
## which we will run GMM.
number_of_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
## Graph plotting method
def makePlot(number_of_clusters, silhouette_scores):
# Plot the each value of 'number of clusters' vs. the silhouette score at that value
fig, ax = plt.subplots(figsize=(16, 6))
ax.set_xlabel('GMM - number of clusters')
ax.set_ylabel('Silhouette Score (higher is better)')
ax.plot(number_of_clusters, silhouette_scores)
# Ticks and grid
xticks = np.arange(min(number_of_clusters), max(number_of_clusters)+1, 1.0)
ax.set_xticks(xticks, minor=False)
ax.set_xticks(xticks, minor=True)
ax.xaxis.grid(True, which='both')
yticks = np.arange(round(min(silhouette_scores), 2), max(silhouette_scores), .02)
ax.set_yticks(yticks, minor=False)
ax.set_yticks(yticks, minor=True)
ax.yaxis.grid(True, which='both')
## Graph the mean silhouette score of each cluster amount.
## Print out the number of clusters that results in the highest
## silhouette score for GMM.
def findBestClusterer(number_of_clusters):
silhouette_scores = []
for i in number_of_clusters:
clusterer = mixture.GMM(n_components=i) # Use the model of your choice here
clusterer.fit(<your data set>) # enter your data set's variable name here
preds = clusterer.predict(<your data set>)
score = silhouette_score(<your data set>, preds)
silhouette_scores.append(score)
## Print a table of all the silhouette scores
print("")
print("| Number of clusters | Silhouette score |")
print("| ------------------ | ---------------- |")
for i in range(len(number_of_clusters)):
## Ensure printed table is properly formatted, taking into account
## amount of digits (either one or two) in the value for number of clusters.
if number_of_clusters[i] <= 9:
print("| {number} | {score:.4f} |".format(number=number_of_clusters[i],
score=round(silhouette_scores[i], 4)))
else:
print("| {number} | {score:.4f} |".format(number=number_of_clusters[i],
score=round(silhouette_scores[i], 4)))
## Graph the plot of silhoutte scores for each amount of clusters
makePlot(number_of_clusters, silhouette_scores)
## Find and print out the cluster amount that gives the highest
## silhouette score.
best_silhouette_score = max(silhouette_scores)
index_of_best_score = silhouette_scores.index(best_silhouette_score)
ideal_number_of_clusters = number_of_clusters[index_of_best_score]
print("")
print("Having {} clusters gives the highest silhouette score of {}.".format(ideal_number_of_clusters,
round(best_silhouette_score, 4)))
findBestClusterer(number_of_clusters)
请注意,在我的示例中,我使用了 GMM 模型而不是 KNN,但是您应该能够稍微修改 findBestClusterer()
方法以使用您希望的任何聚类算法。在此方法中,您还将指定数据集。
我在 Python 中的约 200 个样本的小型监督数据集上训练和测试了 KNN 模型。我想将这些结果应用于包含数千个样本的更大的无监督数据集。
我的问题是:有没有办法用小的监督数据集拟合KNN模型,然后改变大的无监督数据集的K值?我不想通过使用较小数据集中的低 K 值来过度拟合模型,但我不确定如何拟合模型然后更改 Python.
中的 K 值这可以使用 KNN 吗?有没有其他方法可以将 KNN 应用于更大的无监督数据集?
在机器学习中,有两大类学习器,即热切学习器(决策树、神经网络、svms...)和懒惰学习器,例如 KNN
。事实上,KNN
根本没有进行任何学习。它只存储您拥有的 "labeled" 数据,然后使用它来执行推理,以便计算新样本(未标记)与它存储的数据(标记数据)中的所有样本的相似程度。然后根据新样本的 K
个最近实例(K
个最近的邻居因此得名)的多数投票,它将推断它是 class/value。
现在开始回答您的问题,"training" KNN
与 K
本身无关,因此在进行推理时可以随意使用 K
给你最好的结果。
我建议实际在较大的数据集上拟合 KNN 模型几次,每次使用不同的值 k
。对于这些模型中的每一个,您都可以计算 Silhouette Score。
比较各种剪影得分,并为 k
(聚类数)的最终值选择您用于得分最高的模型的值。
举个例子,这是我去年为自己做的一些代码:
from sklearn import mixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
## A list of the different numbers of clusters (the 'n_components' parameter) with
## which we will run GMM.
number_of_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
## Graph plotting method
def makePlot(number_of_clusters, silhouette_scores):
# Plot the each value of 'number of clusters' vs. the silhouette score at that value
fig, ax = plt.subplots(figsize=(16, 6))
ax.set_xlabel('GMM - number of clusters')
ax.set_ylabel('Silhouette Score (higher is better)')
ax.plot(number_of_clusters, silhouette_scores)
# Ticks and grid
xticks = np.arange(min(number_of_clusters), max(number_of_clusters)+1, 1.0)
ax.set_xticks(xticks, minor=False)
ax.set_xticks(xticks, minor=True)
ax.xaxis.grid(True, which='both')
yticks = np.arange(round(min(silhouette_scores), 2), max(silhouette_scores), .02)
ax.set_yticks(yticks, minor=False)
ax.set_yticks(yticks, minor=True)
ax.yaxis.grid(True, which='both')
## Graph the mean silhouette score of each cluster amount.
## Print out the number of clusters that results in the highest
## silhouette score for GMM.
def findBestClusterer(number_of_clusters):
silhouette_scores = []
for i in number_of_clusters:
clusterer = mixture.GMM(n_components=i) # Use the model of your choice here
clusterer.fit(<your data set>) # enter your data set's variable name here
preds = clusterer.predict(<your data set>)
score = silhouette_score(<your data set>, preds)
silhouette_scores.append(score)
## Print a table of all the silhouette scores
print("")
print("| Number of clusters | Silhouette score |")
print("| ------------------ | ---------------- |")
for i in range(len(number_of_clusters)):
## Ensure printed table is properly formatted, taking into account
## amount of digits (either one or two) in the value for number of clusters.
if number_of_clusters[i] <= 9:
print("| {number} | {score:.4f} |".format(number=number_of_clusters[i],
score=round(silhouette_scores[i], 4)))
else:
print("| {number} | {score:.4f} |".format(number=number_of_clusters[i],
score=round(silhouette_scores[i], 4)))
## Graph the plot of silhoutte scores for each amount of clusters
makePlot(number_of_clusters, silhouette_scores)
## Find and print out the cluster amount that gives the highest
## silhouette score.
best_silhouette_score = max(silhouette_scores)
index_of_best_score = silhouette_scores.index(best_silhouette_score)
ideal_number_of_clusters = number_of_clusters[index_of_best_score]
print("")
print("Having {} clusters gives the highest silhouette score of {}.".format(ideal_number_of_clusters,
round(best_silhouette_score, 4)))
findBestClusterer(number_of_clusters)
请注意,在我的示例中,我使用了 GMM 模型而不是 KNN,但是您应该能够稍微修改 findBestClusterer()
方法以使用您希望的任何聚类算法。在此方法中,您还将指定数据集。