在 k-modes 聚类之后为新数据分配聚类的简单方法
Simple approach to assigning clusters for new data after k-modes clustering
我正在使用由数据框 mydf1
创建的 k-modes 模型 (mymodel
)。我希望为新数据框 mydf2
的每一行分配最近的 mymodel
簇。
Similar to this question - just with k-modes instead of k-means。 flexclust
包的 predict
函数仅适用于数值数据,不适用于分类数据。
一个简短的例子:
require(klaR)
set.seed(100)
mydf1 <- data.frame(var1 = as.character(sample(1:20, 50, replace = T)),
var2 = as.character(sample(1:20, 50, replace = T)),
var3 = as.character(sample(1:20, 50, replace = T)))
mydf2 <- data.frame(var1 = as.character(sample(1:20, 50, replace = T)),
var2 = as.character(sample(1:20, 50, replace = T)),
var3 = as.character(sample(1:20, 50, replace = T)))
mymodel <- klaR::kmodes(mydf1, modes = 5)
# Get mode centers
mycenters <- mymodel$modes
# Now I would want to predict which of the 5 clusters each row
# of mydf2 would be closest to, e.g.:
# cluster2 <- predict(mycenters, mydf2)
是否已经有可以使用 k-modes 模型进行预测的函数,或者最简单的方法是什么?谢谢!
我们可以使用 kmodes 算法中使用的距离度量将每个新行分配到其最近的集群。
## From klaR::kmodes
distance <- function(mode, obj, weights) {
if (is.null(weights))
return(sum(mode != obj))
obj <- as.character(obj)
mode <- as.character(mode)
different <- which(mode != obj)
n_mode <- n_obj <- numeric(length(different))
for (i in seq(along = different)) {
weight <- weights[[different[i]]]
names <- names(weight)
n_mode[i] <- weight[which(names == mode[different[i]])]
n_obj[i] <- weight[which(names == obj[different[i]])]
}
dist <- sum((n_mode + n_obj)/(n_mode * n_obj))
return(dist)
}
AssignCluster <- function(df,kmeansObj)
{
apply(
apply(df,1,function(obj)
{
apply(kmeansObj$modes,1,distance,obj,NULL)
}),
2, which.min)
}
AssignCluster(mydf2,mymodel)
[1] 4 3 4 1 1 1 2 2 1 1 5 1 1 3 2 2 1 3 3 1 1 1 1 1 3 1 1 1 3 1 1 1 1 2 1 5 1 3 5 1 1 4 1 1 2 1 1 1 1 1
请注意,这可能会产生大量与多个集群距离相同的条目,然后 which.min
将选择编号最小的集群。
我正在使用由数据框 mydf1
创建的 k-modes 模型 (mymodel
)。我希望为新数据框 mydf2
的每一行分配最近的 mymodel
簇。
Similar to this question - just with k-modes instead of k-means。 flexclust
包的 predict
函数仅适用于数值数据,不适用于分类数据。
一个简短的例子:
require(klaR)
set.seed(100)
mydf1 <- data.frame(var1 = as.character(sample(1:20, 50, replace = T)),
var2 = as.character(sample(1:20, 50, replace = T)),
var3 = as.character(sample(1:20, 50, replace = T)))
mydf2 <- data.frame(var1 = as.character(sample(1:20, 50, replace = T)),
var2 = as.character(sample(1:20, 50, replace = T)),
var3 = as.character(sample(1:20, 50, replace = T)))
mymodel <- klaR::kmodes(mydf1, modes = 5)
# Get mode centers
mycenters <- mymodel$modes
# Now I would want to predict which of the 5 clusters each row
# of mydf2 would be closest to, e.g.:
# cluster2 <- predict(mycenters, mydf2)
是否已经有可以使用 k-modes 模型进行预测的函数,或者最简单的方法是什么?谢谢!
我们可以使用 kmodes 算法中使用的距离度量将每个新行分配到其最近的集群。
## From klaR::kmodes
distance <- function(mode, obj, weights) {
if (is.null(weights))
return(sum(mode != obj))
obj <- as.character(obj)
mode <- as.character(mode)
different <- which(mode != obj)
n_mode <- n_obj <- numeric(length(different))
for (i in seq(along = different)) {
weight <- weights[[different[i]]]
names <- names(weight)
n_mode[i] <- weight[which(names == mode[different[i]])]
n_obj[i] <- weight[which(names == obj[different[i]])]
}
dist <- sum((n_mode + n_obj)/(n_mode * n_obj))
return(dist)
}
AssignCluster <- function(df,kmeansObj)
{
apply(
apply(df,1,function(obj)
{
apply(kmeansObj$modes,1,distance,obj,NULL)
}),
2, which.min)
}
AssignCluster(mydf2,mymodel)
[1] 4 3 4 1 1 1 2 2 1 1 5 1 1 3 2 2 1 3 3 1 1 1 1 1 3 1 1 1 3 1 1 1 1 2 1 5 1 3 5 1 1 4 1 1 2 1 1 1 1 1
请注意,这可能会产生大量与多个集群距离相同的条目,然后 which.min
将选择编号最小的集群。