是否有一种方法可以计算组之间的重叠散点图,以便能够使用 SVM 模型对其进行分类?
Is there an Approach to calculate an overlapping scatterplot between groups so it is capable to be classified with SVM Models?
为了澄清这个问题,我使用了一些数据集来解释二维数据的变体
可以在以下位置访问数据集:https://drive.google.com/file/d/14-VivVlGSlaJo6BXlYMqn-1leorSU6ET/view?usp=sharing
还有一个辅助函数:
scatterplot_check <- function(data, dependent_col, x_column, y_column, legend_pos="topright"){
x11()
data_subsets <- data[,c(which(colnames(data) %in% c(dependent_col, x_column, y_column)))]
if(class(data_subsets[[dependent_col]]) == "factor"){
factor_key <- levels(data_subsets[[dependent_col]])
data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
factor_num <- sort(unique(data_subsets[[dependent_col]]))
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=factor_key, col = factor_num, pch=18)
}
else if(class(data_subsets[[dependent_col]]) == "character"){
data_subsets[[dependent_col]] <- as.factor(data_subsets[[dependent_col]])
factor_key <- levels(data_subsets[[dependent_col]])
data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
factor_num <- sort(unique(data_subsets[[dependent_col]]))
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=factor_key, col = factor_num, pch=18)
}
else if(class(data_subsets[[dependent_col]]) == "integer"){
if(min(data_subsets[[dependent_col]]) == 0){
data_subsets[[dependent_col]] <- data_subsets[[dependent_col]] + 1
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]]-1)),
col = sort(unique(data_subsets[[dependent_col]])), pch=18)
}else{
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]])),
col = sort(unique(data_subsets[[dependent_col]])), pch=18)
}
}
}
假设,我将所有数据读入环境:
dataset1 <- read.csv("dataset1.csv")
dataset2 <- read.csv("dataset2.csv")
dataset3 <- read.csv("dataset3.csv")
下面是散点图的一些变体:
scatterplot_check(dataset1, "y","x.1","x.2")
(这可能能够归类为 SVM 模型)
scatterplot_check(dataset2, "Purchased","Age","EstimatedSalary")
这也可能能够归类为 SVM 模型
scatterplot_check(dataset3, "grades","english","math")
这不可能被归类为 SVM 模型
scatterplot_check(dataset3, "grades","read","math", legend_pos="topleft")
这不太可能能够归类为 SVM 模型
是否有任何最佳方法来计算使用 SVM 模型建模的 2D 散点图的可能性?提前致谢
我正在考虑制作这个,虽然我认为它可能有未来的弱点,但我认为这应该是我计算组间重叠散点图的自定义方法,步骤是:
- 计算范围序列中 X 和 Y 变量的百分比
- 定义百分比阈值(在我的例子中我使用 5%)
- 通过 5% 百分比过滤检查 X 和 Y 分布的结果,如果所有 X 和 Y 变量在每个 class 中具有相同的序列分布。它不太可能被建模为 SVM,因为它显示了对选定 Class 的独立性,另一方面,如果任何 X 和 Y 变量在每个 class 中具有不同的序列分布,则它很可能被建模为 SVM因为它显示了与所选 Class
不同的分布
这是我对这 4 个案例实施的结果:
d1_compare <- dataset_class_comparison(dataset1, "y", "x.1", "x.2")
============================================================================
Class = -1
SeqX(-10,10,1)
SeqY(-10,10,1)
x.1_-2 to -1 (pct) x.1_-1 to 0 (pct) x.1_0 to 1 (pct) x.1_1 to 2 (pct)
0.16 0.38 0.30 0.10
x.2_-2 to -1 (pct) x.2_-1 to 0 (pct) x.2_0 to 1 (pct) x.2_1 to 2 (pct)
0.14 0.28 0.46 0.08
============================================================================
============================================================================
Class = 1
SeqX(-10,10,1)
SeqY(-10,10,1)
x.1_-1 to 0 (pct) x.1_1 to 2 (pct) x.1_2 to 3 (pct) x.1_3 to 4 (pct)
0.08 0.42 0.36 0.08
x.2_-1 to 0 (pct) x.2_0 to 1 (pct) x.2_1 to 2 (pct) x.2_2 to 3 (pct) x.2_3 to 4 (pct)
0.06 0.26 0.38 0.20 0.06
============================================================================
Conclusion: Since each class within a 5% threshold not having similiar distribution from x.1 or x.2
SVM Likely can be modeled
d2_compare <- dataset_class_comparison(dataset2, "Purchased", "Age", "EstimatedSalary")
============================================================================
Class = 0
SeqX(10,100,10)
SeqY(10000,1e+06,10000)
Age_10 to 20 (pct) Age_20 to 30 (pct) Age_30 to 40 (pct) Age_40 to 50 (pct)
0.066 0.325 0.413 0.178
EstimatedSalary_10000 to 20000 (pct) EstimatedSalary_20000 to 30000 (pct) EstimatedSalary_30000 to 40000 (pct)
0.063 0.077 0.059
EstimatedSalary_40000 to 50000 (pct) EstimatedSalary_50000 to 60000 (pct) EstimatedSalary_60000 to 70000 (pct)
0.098 0.182 0.112
EstimatedSalary_70000 to 80000 (pct) EstimatedSalary_80000 to 90000 (pct)
0.210 0.150
============================================================================
============================================================================
Class = 1
SeqX(10,100,10)
SeqY(10000,1e+06,10000)
Age_30 to 40 (pct) Age_40 to 50 (pct) Age_50 to 60 (pct)
0.222 0.392 0.304
EstimatedSalary_20000 to 30000 (pct) EstimatedSalary_30000 to 40000 (pct) EstimatedSalary_40000 to 50000 (pct)
0.123 0.105 0.056
EstimatedSalary_70000 to 80000 (pct) EstimatedSalary_80000 to 90000 (pct) EstimatedSalary_90000 to 1e+05 (pct)
0.080 0.080 0.074
EstimatedSalary_1e+05 to 110000 (pct) EstimatedSalary_110000 to 120000 (pct) EstimatedSalary_120000 to 130000 (pct)
0.093 0.062 0.062
EstimatedSalary_130000 to 140000 (pct) EstimatedSalary_140000 to 150000 (pct)
0.093 0.099
============================================================================
Conclusion: Since each class within a 5% threshold not having similiar distribution from Age or EstimatedSalary
SVM Likely can be modeled
d3_compare <- dataset_class_comparison(dataset3, "grades", "english", "math")
============================================================================
Class = KK-08
SeqX(0,100,10)
SeqY(100,1000,100)
english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct)
0.571 0.162 0.061 0.084 0.056
math_600 to 700 (pct)
0.989
============================================================================
============================================================================
Class = KK-06
SeqX(0,100,10)
SeqY(100,1000,100)
english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct)
0.377 0.262 0.098 0.131 0.066
math_600 to 700 (pct)
0.984
============================================================================
Conclusion: Since each class within a 5% threshold having similiar distribution either from english and math
SVM Unlikely can be modeled
d4_compare <- dataset_class_comparison(dataset3, "grades", "math", "read")
============================================================================
Class = KK-08
SeqX(100,1000,100)
SeqY(100,1000,100)
math_600 to 700 (pct)
0.989
read_600 to 700 (pct)
0.992
============================================================================
============================================================================
Class = KK-06
SeqX(100,1000,100)
SeqY(100,1000,100)
math_600 to 700 (pct)
0.984
read_600 to 700 (pct)
1
============================================================================
Conclusion: Since each class within a 5% threshold having similiar distribution either from math and read
SVM Unlikely can be modeled
dataset_class_comparison
是一个超过 300 行的自定义函数,可以在 https://drive.google.com/file/d/1RmIhbNnKZWS2jFIsS9p4LWjhcbikpOga/view?usp=sharing
中找到
为了澄清这个问题,我使用了一些数据集来解释二维数据的变体
可以在以下位置访问数据集:https://drive.google.com/file/d/14-VivVlGSlaJo6BXlYMqn-1leorSU6ET/view?usp=sharing
还有一个辅助函数:
scatterplot_check <- function(data, dependent_col, x_column, y_column, legend_pos="topright"){
x11()
data_subsets <- data[,c(which(colnames(data) %in% c(dependent_col, x_column, y_column)))]
if(class(data_subsets[[dependent_col]]) == "factor"){
factor_key <- levels(data_subsets[[dependent_col]])
data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
factor_num <- sort(unique(data_subsets[[dependent_col]]))
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=factor_key, col = factor_num, pch=18)
}
else if(class(data_subsets[[dependent_col]]) == "character"){
data_subsets[[dependent_col]] <- as.factor(data_subsets[[dependent_col]])
factor_key <- levels(data_subsets[[dependent_col]])
data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
factor_num <- sort(unique(data_subsets[[dependent_col]]))
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=factor_key, col = factor_num, pch=18)
}
else if(class(data_subsets[[dependent_col]]) == "integer"){
if(min(data_subsets[[dependent_col]]) == 0){
data_subsets[[dependent_col]] <- data_subsets[[dependent_col]] + 1
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]]-1)),
col = sort(unique(data_subsets[[dependent_col]])), pch=18)
}else{
plot(data_subsets[[x_column]],data_subsets[[y_column]],
col = data_subsets[[dependent_col]], pch=18,
xlab=x_column, ylab=y_column)
legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]])),
col = sort(unique(data_subsets[[dependent_col]])), pch=18)
}
}
}
假设,我将所有数据读入环境:
dataset1 <- read.csv("dataset1.csv")
dataset2 <- read.csv("dataset2.csv")
dataset3 <- read.csv("dataset3.csv")
下面是散点图的一些变体:
scatterplot_check(dataset1, "y","x.1","x.2")
(这可能能够归类为 SVM 模型)
scatterplot_check(dataset2, "Purchased","Age","EstimatedSalary")
这也可能能够归类为 SVM 模型
scatterplot_check(dataset3, "grades","english","math")
这不可能被归类为 SVM 模型
scatterplot_check(dataset3, "grades","read","math", legend_pos="topleft")
这不太可能能够归类为 SVM 模型
是否有任何最佳方法来计算使用 SVM 模型建模的 2D 散点图的可能性?提前致谢
我正在考虑制作这个,虽然我认为它可能有未来的弱点,但我认为这应该是我计算组间重叠散点图的自定义方法,步骤是:
- 计算范围序列中 X 和 Y 变量的百分比
- 定义百分比阈值(在我的例子中我使用 5%)
- 通过 5% 百分比过滤检查 X 和 Y 分布的结果,如果所有 X 和 Y 变量在每个 class 中具有相同的序列分布。它不太可能被建模为 SVM,因为它显示了对选定 Class 的独立性,另一方面,如果任何 X 和 Y 变量在每个 class 中具有不同的序列分布,则它很可能被建模为 SVM因为它显示了与所选 Class 不同的分布
这是我对这 4 个案例实施的结果:
d1_compare <- dataset_class_comparison(dataset1, "y", "x.1", "x.2")
============================================================================
Class = -1
SeqX(-10,10,1)
SeqY(-10,10,1)
x.1_-2 to -1 (pct) x.1_-1 to 0 (pct) x.1_0 to 1 (pct) x.1_1 to 2 (pct)
0.16 0.38 0.30 0.10
x.2_-2 to -1 (pct) x.2_-1 to 0 (pct) x.2_0 to 1 (pct) x.2_1 to 2 (pct)
0.14 0.28 0.46 0.08
============================================================================
============================================================================
Class = 1
SeqX(-10,10,1)
SeqY(-10,10,1)
x.1_-1 to 0 (pct) x.1_1 to 2 (pct) x.1_2 to 3 (pct) x.1_3 to 4 (pct)
0.08 0.42 0.36 0.08
x.2_-1 to 0 (pct) x.2_0 to 1 (pct) x.2_1 to 2 (pct) x.2_2 to 3 (pct) x.2_3 to 4 (pct)
0.06 0.26 0.38 0.20 0.06
============================================================================
Conclusion: Since each class within a 5% threshold not having similiar distribution from x.1 or x.2
SVM Likely can be modeled
d2_compare <- dataset_class_comparison(dataset2, "Purchased", "Age", "EstimatedSalary")
============================================================================
Class = 0
SeqX(10,100,10)
SeqY(10000,1e+06,10000)
Age_10 to 20 (pct) Age_20 to 30 (pct) Age_30 to 40 (pct) Age_40 to 50 (pct)
0.066 0.325 0.413 0.178
EstimatedSalary_10000 to 20000 (pct) EstimatedSalary_20000 to 30000 (pct) EstimatedSalary_30000 to 40000 (pct)
0.063 0.077 0.059
EstimatedSalary_40000 to 50000 (pct) EstimatedSalary_50000 to 60000 (pct) EstimatedSalary_60000 to 70000 (pct)
0.098 0.182 0.112
EstimatedSalary_70000 to 80000 (pct) EstimatedSalary_80000 to 90000 (pct)
0.210 0.150
============================================================================
============================================================================
Class = 1
SeqX(10,100,10)
SeqY(10000,1e+06,10000)
Age_30 to 40 (pct) Age_40 to 50 (pct) Age_50 to 60 (pct)
0.222 0.392 0.304
EstimatedSalary_20000 to 30000 (pct) EstimatedSalary_30000 to 40000 (pct) EstimatedSalary_40000 to 50000 (pct)
0.123 0.105 0.056
EstimatedSalary_70000 to 80000 (pct) EstimatedSalary_80000 to 90000 (pct) EstimatedSalary_90000 to 1e+05 (pct)
0.080 0.080 0.074
EstimatedSalary_1e+05 to 110000 (pct) EstimatedSalary_110000 to 120000 (pct) EstimatedSalary_120000 to 130000 (pct)
0.093 0.062 0.062
EstimatedSalary_130000 to 140000 (pct) EstimatedSalary_140000 to 150000 (pct)
0.093 0.099
============================================================================
Conclusion: Since each class within a 5% threshold not having similiar distribution from Age or EstimatedSalary
SVM Likely can be modeled
d3_compare <- dataset_class_comparison(dataset3, "grades", "english", "math")
============================================================================
Class = KK-08
SeqX(0,100,10)
SeqY(100,1000,100)
english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct)
0.571 0.162 0.061 0.084 0.056
math_600 to 700 (pct)
0.989
============================================================================
============================================================================
Class = KK-06
SeqX(0,100,10)
SeqY(100,1000,100)
english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct)
0.377 0.262 0.098 0.131 0.066
math_600 to 700 (pct)
0.984
============================================================================
Conclusion: Since each class within a 5% threshold having similiar distribution either from english and math
SVM Unlikely can be modeled
d4_compare <- dataset_class_comparison(dataset3, "grades", "math", "read")
============================================================================
Class = KK-08
SeqX(100,1000,100)
SeqY(100,1000,100)
math_600 to 700 (pct)
0.989
read_600 to 700 (pct)
0.992
============================================================================
============================================================================
Class = KK-06
SeqX(100,1000,100)
SeqY(100,1000,100)
math_600 to 700 (pct)
0.984
read_600 to 700 (pct)
1
============================================================================
Conclusion: Since each class within a 5% threshold having similiar distribution either from math and read
SVM Unlikely can be modeled
dataset_class_comparison
是一个超过 300 行的自定义函数,可以在 https://drive.google.com/file/d/1RmIhbNnKZWS2jFIsS9p4LWjhcbikpOga/view?usp=sharing