R 中的空间 clustering/sampling
Spatial clustering/sampling in R
我在 R 中有一个空间数据框。我们有一个 class 不平衡问题,所以我希望能够删除正例(我们的响应变量是二进制的,正例大约占数据集的 10% ) 然后 select 部分负面案例来对抗模型中的 class 不平衡。我想要 select 个在空间上密切相关的负面案例,我真的很难弄清楚如何。
我想到的一些可行的想法
- KNN 对负样本进行聚类
- 叠加空间网格并从每个网格正方形中提取 x 个样本
- 缓冲区分析并在缓冲区
内随机select
如果有人对如何在 R 中执行此操作有建议,那就太棒了。
谢谢
只是在这里回答以防其他人搜索此内容。
我决定使用 kmeans 聚类,然后将该聚类作为 col 添加到 dB,并从聚类中随机抽样。
代码如下!
##CLuster analysis set.seed(1) clusdb <- W_neg[c(
"x_coor_farm", "y_coor_farm",
"Area_Farm_SqM", "NatGrass_1km_buff",
"BioFor_1km_buff", "MixedFor_1km_buff",
"Area_Cut_012", "Area_Cut_1224", "Area_Cut_2436", "Cut_Count_012", "Cut_Count_1224", "Cut_Count_2436")]
##Write functuon to loop the algorithim kmean_withinss <- function(k) { cluster <- kmeans(clusdb, k) return (cluster$tot.withinss) }
# Set maximum cluster max_k <-20
# Run algorithm over a range of k wss <- sapply(2:max_k, kmean_withinss)
#Dataframe of kmeans output to find optimal K elbow <-data.frame(2:max_k, wss)
#plot library(ggplot2) ggplot(elbow, aes(x = X2.max_k, y = wss)) + geom_point() + geom_line() + scale_x_continuous(breaks = seq(1, 20, by = 1))
#Optimal K = 8
#Re-run the model with optimal K
pc_cluster_2 <-kmeans(clusdb, 8) pc_cluster_2$cluster pc_cluster_2$centers pc_cluster_2$size
pc_cluster_2$totss pc_cluster_2$betweenss
pc_cluster_2$betweenss/pc_cluster_2$totss*100
#92%
#Add col to dataframe W_neg$cluster <-pc_cluster_2$cluster
W_neg <- W_neg[c("TB2017", "x_coor_farm", "y_coor_farm", "Area_Farm_SqM", "NatGrass_1km_buff", "BioFor_1km_buff", "MixedFor_1km_buff", "Area_Cut_012", "Area_Cut_1224", "Area_Cut_2436", "Cut_Count_012", "Cut_Count_1224", "Cut_Count_2436", "cluster")]
ggplot(data = W_neg, aes(y = cluster)) + geom_bar(aes(fill = TB2017)) + ggtitle("Count of Clusters by Region") + theme(plot.title = element_text(hjust = 0.5))
fviz_cluster(pc_cluster_2, data = scale(clusdb), geom = c("point"),ellipse.type = "euclid")
我在 R 中有一个空间数据框。我们有一个 class 不平衡问题,所以我希望能够删除正例(我们的响应变量是二进制的,正例大约占数据集的 10% ) 然后 select 部分负面案例来对抗模型中的 class 不平衡。我想要 select 个在空间上密切相关的负面案例,我真的很难弄清楚如何。
我想到的一些可行的想法
- KNN 对负样本进行聚类
- 叠加空间网格并从每个网格正方形中提取 x 个样本
- 缓冲区分析并在缓冲区 内随机select
如果有人对如何在 R 中执行此操作有建议,那就太棒了。
谢谢
只是在这里回答以防其他人搜索此内容。
我决定使用 kmeans 聚类,然后将该聚类作为 col 添加到 dB,并从聚类中随机抽样。
代码如下!
##CLuster analysis set.seed(1) clusdb <- W_neg[c(
"x_coor_farm", "y_coor_farm",
"Area_Farm_SqM", "NatGrass_1km_buff",
"BioFor_1km_buff", "MixedFor_1km_buff",
"Area_Cut_012", "Area_Cut_1224", "Area_Cut_2436", "Cut_Count_012", "Cut_Count_1224", "Cut_Count_2436")]
##Write functuon to loop the algorithim kmean_withinss <- function(k) { cluster <- kmeans(clusdb, k) return (cluster$tot.withinss) }
# Set maximum cluster max_k <-20
# Run algorithm over a range of k wss <- sapply(2:max_k, kmean_withinss)
#Dataframe of kmeans output to find optimal K elbow <-data.frame(2:max_k, wss)
#plot library(ggplot2) ggplot(elbow, aes(x = X2.max_k, y = wss)) + geom_point() + geom_line() + scale_x_continuous(breaks = seq(1, 20, by = 1))
#Optimal K = 8
#Re-run the model with optimal K
pc_cluster_2 <-kmeans(clusdb, 8) pc_cluster_2$cluster pc_cluster_2$centers pc_cluster_2$size
pc_cluster_2$totss pc_cluster_2$betweenss
pc_cluster_2$betweenss/pc_cluster_2$totss*100
#92%
#Add col to dataframe W_neg$cluster <-pc_cluster_2$cluster
W_neg <- W_neg[c("TB2017", "x_coor_farm", "y_coor_farm", "Area_Farm_SqM", "NatGrass_1km_buff", "BioFor_1km_buff", "MixedFor_1km_buff", "Area_Cut_012", "Area_Cut_1224", "Area_Cut_2436", "Cut_Count_012", "Cut_Count_1224", "Cut_Count_2436", "cluster")]
ggplot(data = W_neg, aes(y = cluster)) + geom_bar(aes(fill = TB2017)) + ggtitle("Count of Clusters by Region") + theme(plot.title = element_text(hjust = 0.5))
fviz_cluster(pc_cluster_2, data = scale(clusdb), geom = c("point"),ellipse.type = "euclid")