是否有一种方法可以计算组之间的重叠散点图,以便能够使用 SVM 模型对其进行分类?

Is there an Approach to calculate an overlapping scatterplot between groups so it is capable to be classified with SVM Models?

为了澄清这个问题,我使用了一些数据集来解释二维数据的变体

可以在以下位置访问数据集:https://drive.google.com/file/d/14-VivVlGSlaJo6BXlYMqn-1leorSU6ET/view?usp=sharing

还有一个辅助函数:

scatterplot_check <- function(data, dependent_col, x_column, y_column, legend_pos="topright"){
  x11()
  data_subsets <- data[,c(which(colnames(data) %in% c(dependent_col, x_column, y_column)))]
  if(class(data_subsets[[dependent_col]]) == "factor"){
    factor_key <- levels(data_subsets[[dependent_col]])
    data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
    factor_num <- sort(unique(data_subsets[[dependent_col]]))
    plot(data_subsets[[x_column]],data_subsets[[y_column]], 
         col = data_subsets[[dependent_col]], pch=18, 
         xlab=x_column, ylab=y_column)
    legend(legend_pos, legend=factor_key, col = factor_num, pch=18) 
  }
  else if(class(data_subsets[[dependent_col]]) == "character"){
    data_subsets[[dependent_col]] <- as.factor(data_subsets[[dependent_col]])
    factor_key <- levels(data_subsets[[dependent_col]])
    data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
    factor_num <- sort(unique(data_subsets[[dependent_col]]))
    plot(data_subsets[[x_column]],data_subsets[[y_column]], 
         col = data_subsets[[dependent_col]], pch=18, 
         xlab=x_column, ylab=y_column)
    legend(legend_pos, legend=factor_key, col = factor_num, pch=18) 
  }
  else if(class(data_subsets[[dependent_col]]) == "integer"){
    if(min(data_subsets[[dependent_col]]) == 0){
      data_subsets[[dependent_col]] <- data_subsets[[dependent_col]] + 1
      plot(data_subsets[[x_column]],data_subsets[[y_column]], 
           col = data_subsets[[dependent_col]], pch=18, 
           xlab=x_column, ylab=y_column)
      legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]]-1)), 
             col = sort(unique(data_subsets[[dependent_col]])), pch=18) 
    }else{
      plot(data_subsets[[x_column]],data_subsets[[y_column]], 
           col = data_subsets[[dependent_col]], pch=18, 
           xlab=x_column, ylab=y_column)
      legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]])), 
             col = sort(unique(data_subsets[[dependent_col]])), pch=18) 
    }
  }
}

假设,我将所有数据读入环境:

dataset1 <- read.csv("dataset1.csv")
dataset2 <- read.csv("dataset2.csv")
dataset3 <- read.csv("dataset3.csv")

下面是散点图的一些变体:

scatterplot_check(dataset1, "y","x.1","x.2")

(这可能能够归类为 SVM 模型)

scatterplot_check(dataset2, "Purchased","Age","EstimatedSalary")

也可能能够归类为 SVM 模型

scatterplot_check(dataset3, "grades","english","math")

可能被归类为 SVM 模型

scatterplot_check(dataset3, "grades","read","math", legend_pos="topleft")

不太可能能够归类为 SVM 模型

是否有任何最佳方法来计算使用 SVM 模型建模的 2D 散点图的可能性?提前致谢

我正在考虑制作这个,虽然我认为它可能有未来的弱点,但我认为这应该是我计算组间重叠散点图的自定义方法,步骤是:

  1. 计算范围序列中 X 和 Y 变量的百分比
  2. 定义百分比阈值(在我的例子中我使用 5%)
  3. 通过 5% 百分比过滤检查 X 和 Y 分布的结果,如果所有 X 和 Y 变量在每个 class 中具有相同的序列分布。它不太可能被建模为 SVM,因为它显示了对选定 Class 的独立性,另一方面,如果任何 X 和 Y 变量在每个 class 中具有不同的序列分布,则它很可能被建模为 SVM因为它显示了与所选 Class
  4. 不同的分布

这是我对这 4 个案例实施的结果:

d1_compare <- dataset_class_comparison(dataset1, "y", "x.1", "x.2")
============================================================================
Class = -1
SeqX(-10,10,1)
SeqY(-10,10,1)
x.1_-2 to -1 (pct)  x.1_-1 to 0 (pct)   x.1_0 to 1 (pct)   x.1_1 to 2 (pct) 
              0.16               0.38               0.30               0.10 
x.2_-2 to -1 (pct)  x.2_-1 to 0 (pct)   x.2_0 to 1 (pct)   x.2_1 to 2 (pct) 
              0.14               0.28               0.46               0.08 
============================================================================
============================================================================
Class = 1
SeqX(-10,10,1)
SeqY(-10,10,1)
x.1_-1 to 0 (pct)  x.1_1 to 2 (pct)  x.1_2 to 3 (pct)  x.1_3 to 4 (pct) 
             0.08              0.42              0.36              0.08 
x.2_-1 to 0 (pct)  x.2_0 to 1 (pct)  x.2_1 to 2 (pct)  x.2_2 to 3 (pct)  x.2_3 to 4 (pct) 
             0.06              0.26              0.38              0.20              0.06 
============================================================================
Conclusion: Since each class within a 5% threshold not having similiar distribution from x.1 or x.2
SVM Likely can be modeled


d2_compare <- dataset_class_comparison(dataset2, "Purchased", "Age", "EstimatedSalary")
============================================================================
Class = 0
SeqX(10,100,10)
SeqY(10000,1e+06,10000)
Age_10 to 20 (pct) Age_20 to 30 (pct) Age_30 to 40 (pct) Age_40 to 50 (pct) 
             0.066              0.325              0.413              0.178 
EstimatedSalary_10000 to 20000 (pct) EstimatedSalary_20000 to 30000 (pct) EstimatedSalary_30000 to 40000 (pct) 
                               0.063                                0.077                                0.059 
EstimatedSalary_40000 to 50000 (pct) EstimatedSalary_50000 to 60000 (pct) EstimatedSalary_60000 to 70000 (pct) 
                               0.098                                0.182                                0.112 
EstimatedSalary_70000 to 80000 (pct) EstimatedSalary_80000 to 90000 (pct) 
                               0.210                                0.150 
============================================================================
============================================================================
Class = 1
SeqX(10,100,10)
SeqY(10000,1e+06,10000)
Age_30 to 40 (pct) Age_40 to 50 (pct) Age_50 to 60 (pct) 
             0.222              0.392              0.304 
  EstimatedSalary_20000 to 30000 (pct)   EstimatedSalary_30000 to 40000 (pct)   EstimatedSalary_40000 to 50000 (pct) 
                                 0.123                                  0.105                                  0.056 
  EstimatedSalary_70000 to 80000 (pct)   EstimatedSalary_80000 to 90000 (pct)   EstimatedSalary_90000 to 1e+05 (pct) 
                                 0.080                                  0.080                                  0.074 
 EstimatedSalary_1e+05 to 110000 (pct) EstimatedSalary_110000 to 120000 (pct) EstimatedSalary_120000 to 130000 (pct) 
                                 0.093                                  0.062                                  0.062 
EstimatedSalary_130000 to 140000 (pct) EstimatedSalary_140000 to 150000 (pct) 
                                 0.093                                  0.099 
============================================================================
Conclusion: Since each class within a 5% threshold not having similiar distribution from Age or EstimatedSalary
SVM Likely can be modeled


d3_compare <- dataset_class_comparison(dataset3, "grades", "english", "math")
============================================================================
Class = KK-08
SeqX(0,100,10)
SeqY(100,1000,100)
 english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct) 
                 0.571                  0.162                  0.061                  0.084                  0.056 
math_600 to 700 (pct) 
                0.989 
============================================================================
============================================================================
Class = KK-06
SeqX(0,100,10)
SeqY(100,1000,100)
 english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct) 
                 0.377                  0.262                  0.098                  0.131                  0.066 
math_600 to 700 (pct) 
                0.984 
============================================================================
Conclusion: Since each class within a 5% threshold having similiar distribution either from english and math
SVM Unlikely can be modeled



d4_compare <- dataset_class_comparison(dataset3, "grades", "math", "read")
 ============================================================================
Class = KK-08
SeqX(100,1000,100)
SeqY(100,1000,100)
math_600 to 700 (pct) 
                0.989 
read_600 to 700 (pct) 
                0.992 
============================================================================
============================================================================
Class = KK-06
SeqX(100,1000,100)
SeqY(100,1000,100)
math_600 to 700 (pct) 
                0.984 
read_600 to 700 (pct) 
                    1 
============================================================================
Conclusion: Since each class within a 5% threshold having similiar distribution either from math and read
SVM Unlikely can be modeled

dataset_class_comparison 是一个超过 300 行的自定义函数,可以在 https://drive.google.com/file/d/1RmIhbNnKZWS2jFIsS9p4LWjhcbikpOga/view?usp=sharing

中找到