无监督分类:将 类 分配给数据
Unsupervised Classification: Assign classes to to data
我有一组来自钻孔的数据,它包含每 2 米的不同地质力学特性的信息。我正在尝试创建地质力学域,并将每个点分配给不同的域。
我正在尝试使用随机森林分类,但不确定如何将邻近矩阵(或 randomForest 函数的任何结果)与标签相关联。
到目前为止我简陋的代码如下:
dh <- read.csv("gt_1_classification.csv", header = T)
#replace all N/A with 0
dh[is.na(dh)] <- 0
library(randomForest)
dh_rf <- randomForest(dh, importance=TRUE, proximity=FALSE, ntree=500, type=unsupervised, forest=NULL)
我希望分类器自行决定域。
任何帮助都会很棒!
Hack-R 是正确的 -- 首先有必要使用一些聚类(无监督学习)方法来探索数据。我提供了一些使用 R 内置 mtcars 数据的示例代码作为演示:
# Info on the data
?mtcars
head(mtcars)
pairs(mtcars) # Matrix plot
# Calculate the distance between each row (car with it's variables)
# by default, Euclidean distance = sqrt(sum((x_i - y_i)^2)
?dist
d <- dist(mtcars)
d # Potentially huge matrix
# Use the distance matrix for clustering
# First we'll try hierarchical clustering
?hclust
c <- hclust(d)
c
# Plot dendrogram of clusters
plot(c)
# We might want to try 3 clusters
# need to specify either k = # of groups
groups3 <- cutree(c, k = 3) # "g3" = "groups 3"
# cutree(hcmt, h = 230) will give same result
groups3
# Or we could do several groups at once
groupsmultiple <- cutree(c, k = 2:5)
head(groupsmultiple)
# Draw boxes around clusters
rect.hclust(c, k = 2, border = "gray")
rect.hclust(c, k = 3, border = "blue")
rect.hclust(c, k = 4, border = "green4")
rect.hclust(c, k = 5, border = "darkred")
# Alternatively we can try K-means clustering
# k-means clustering
?kmeans
km <- kmeans(mtcars, 5)
km
# Graph based on k-means
install.packages("cluster")
require(cluster)
clusplot(mtcars, # data frame
km$cluster, # cluster data
color = TRUE, # color
lines = 3, # Lines connecting centroids
labels = 2) # Labels clusters and cases
在对您自己的数据进行 运行 之后,请考虑哪种聚类定义能够满足您感兴趣的相似程度。然后,您可以为每个集群创建一个具有 "level" 的新变量,然后为其创建一个监督模型。
这是一个使用相同 mtcars 数据的决策树示例。请注意,我在这里使用 mpg 作为响应——您可能希望使用基于集群的新变量。
install.packages("rpart")
library(rpart)
?rpart
# grow tree
tree_mtcars <- rpart(mpg ~ ., method = "anova", data = mtcars)
tree_mtcars <- rpart(mpg ~ ., data = mtcars)
tree_mtcars
summary(tree_mtcars) # detailed summary of splits
# Get R-squared
rsq.rpart(tree_mtcars)
?rsq.rpart
# plot tree
plot(tree_mtcars, uniform = TRUE, main = "Regression Tree for mpg ")
text(tree_mtcars, use.n = TRUE, all = TRUE, cex = .8)
text(tree_mtcars, use.n = TRUE, all = TRUE, cex = .8)
请注意,虽然信息量很大,但基本的决策树通常不适合预测。如果需要预测,还应该探索其他模型。
我有一组来自钻孔的数据,它包含每 2 米的不同地质力学特性的信息。我正在尝试创建地质力学域,并将每个点分配给不同的域。
我正在尝试使用随机森林分类,但不确定如何将邻近矩阵(或 randomForest 函数的任何结果)与标签相关联。
到目前为止我简陋的代码如下:
dh <- read.csv("gt_1_classification.csv", header = T)
#replace all N/A with 0
dh[is.na(dh)] <- 0
library(randomForest)
dh_rf <- randomForest(dh, importance=TRUE, proximity=FALSE, ntree=500, type=unsupervised, forest=NULL)
我希望分类器自行决定域。
任何帮助都会很棒!
Hack-R 是正确的 -- 首先有必要使用一些聚类(无监督学习)方法来探索数据。我提供了一些使用 R 内置 mtcars 数据的示例代码作为演示:
# Info on the data
?mtcars
head(mtcars)
pairs(mtcars) # Matrix plot
# Calculate the distance between each row (car with it's variables)
# by default, Euclidean distance = sqrt(sum((x_i - y_i)^2)
?dist
d <- dist(mtcars)
d # Potentially huge matrix
# Use the distance matrix for clustering
# First we'll try hierarchical clustering
?hclust
c <- hclust(d)
c
# Plot dendrogram of clusters
plot(c)
# We might want to try 3 clusters
# need to specify either k = # of groups
groups3 <- cutree(c, k = 3) # "g3" = "groups 3"
# cutree(hcmt, h = 230) will give same result
groups3
# Or we could do several groups at once
groupsmultiple <- cutree(c, k = 2:5)
head(groupsmultiple)
# Draw boxes around clusters
rect.hclust(c, k = 2, border = "gray")
rect.hclust(c, k = 3, border = "blue")
rect.hclust(c, k = 4, border = "green4")
rect.hclust(c, k = 5, border = "darkred")
# Alternatively we can try K-means clustering
# k-means clustering
?kmeans
km <- kmeans(mtcars, 5)
km
# Graph based on k-means
install.packages("cluster")
require(cluster)
clusplot(mtcars, # data frame
km$cluster, # cluster data
color = TRUE, # color
lines = 3, # Lines connecting centroids
labels = 2) # Labels clusters and cases
在对您自己的数据进行 运行 之后,请考虑哪种聚类定义能够满足您感兴趣的相似程度。然后,您可以为每个集群创建一个具有 "level" 的新变量,然后为其创建一个监督模型。
这是一个使用相同 mtcars 数据的决策树示例。请注意,我在这里使用 mpg 作为响应——您可能希望使用基于集群的新变量。
install.packages("rpart")
library(rpart)
?rpart
# grow tree
tree_mtcars <- rpart(mpg ~ ., method = "anova", data = mtcars)
tree_mtcars <- rpart(mpg ~ ., data = mtcars)
tree_mtcars
summary(tree_mtcars) # detailed summary of splits
# Get R-squared
rsq.rpart(tree_mtcars)
?rsq.rpart
# plot tree
plot(tree_mtcars, uniform = TRUE, main = "Regression Tree for mpg ")
text(tree_mtcars, use.n = TRUE, all = TRUE, cex = .8)
text(tree_mtcars, use.n = TRUE, all = TRUE, cex = .8)
请注意,虽然信息量很大,但基本的决策树通常不适合预测。如果需要预测,还应该探索其他模型。