如何正确使用 K 最近邻?
How to properly use K-Nearest-Neighbour?
我已经在 R 中生成了一些数据并将贝叶斯分类器应用于这些点。它们都被分类为 "orange" 或 "blue"。我无法从 knn
函数获得准确的结果,因为我认为 类("blue"、"orange")没有正确链接到 knn
。
我的训练数据在一个数据框中(x, y)
。我的 类 在一个单独的数组中。对于贝叶斯分类器,我是这样做的——它更容易绘制。但是,现在我不知道如何将我的 "plug in" 我的 类 变成 knn
。使用以下代码是非常不准确的。我已将 k
更改为许多不同的测试值,所有值都不准确。
library(class)
x <- round(runif(100, 1, 100))
y <- round(runif(100, 1, 100))
train.df <- data.frame(x, y)
x.test <- round(runif(100, 1, 100))
y.test <- round(runif(100, 1, 100))
test.df <- data.frame(x.test, y.test)
cl <- factor(c(rep("blue", 50), rep("orange", 50)))
k <- knn(train.df, test.df, cl, k=100)
同样,我排序的 类 位于数组 classes
中,在代码的更上方。
这是我的完整文档。上面的代码在最底部。
library(class)
n <- 100
x <- round(runif(n, 1, n))
y <- round(runif(n, 1, n))
# ============================================================
# Bayes Classifier + Decision Boundary Code
# ============================================================
classes <- "null"
colours <- "null"
for (i in 1:n)
{
# P(C = j | X = x, Y = y) = prob
# "The probability that the class (C) is orange (j) when X is some x, and Y is some y"
# Two predictors that influence classification: x, y
# If x and y are both under 50, there is a 90% chance of being orange (grouping)
# If x and y and both over 50, or if one of them is over 50, grouping is blue
# Algorithm favours whichever grouping has a higher chance of success, then plots using that colour
# When prob (from above) is 50%, the boundary is drawn
percentChance <- 0
if (x[i] < 50 && y[i] < 50)
{
# 95% chance of orange and 5% chance of blue
# Bayes Decision Boundary therefore assigns to orange when x < 50 and y < 50
# "colours" is the Decision Boundary grouping, not the plotted grouping
percentChance <- 95
colours[i] <- "orange"
}
else
{
percentChance <- 10
colours[i] <- "blue"
}
if (round(runif(1, 1, 100)) > percentChance)
{
classes[i] <- "blue"
}
else
{
classes[i] <- "orange"
}
}
boundary.x <- seq(0, 100, by=1)
boundary.y <- 0
for (i in 1:101)
{
if (i > 49)
{
boundary.y[i] <- -10 # just for the sake of visual consistency, real value is 0
}
else
{
boundary.y[i] <- 50
}
}
df <- data.frame(boundary.x, boundary.y)
plot(x, y, col=classes)
lines(df, type="l", lty=2, lwd=2, col="red")
# ============================================================
# K-Nearest neighbour code
# ============================================================
#library(class)
#x <- round(runif(100, 1, 100))
#y <- round(runif(100, 1, 100))
train.df <- data.frame(x, y)
x.test <- round(runif(n, 1, n))
y.test <- round(runif(n, 1, n))
test.df <- data.frame(x.test, y.test)
cl <- factor(c(rep("blue", 50), rep("orange", 50)))
k <- knn(train.df, test.df, cl, k=(round(sqrt(n))))
感谢您的帮助
首先,为了可重复性,您应该在生成一组随机数之前设置一个种子,就像 runif
或 运行 任何 simulations/ML 随机算法所做的那样。请注意,在下面的代码中,我们为生成 x
的所有实例设置相同的种子,并为生成 y
的所有实例设置不同的种子。这样,伪随机生成的 x
总是相同的(但不同于 y
),y
.
也是如此
library(class)
n <- 100
set.seed(1)
x <- round(runif(n, 1, n))
set.seed(2)
y <- round(runif(n, 1, n))
# ============================================================
# Bayes Classifier + Decision Boundary Code
# ============================================================
classes <- "null"
colours <- "null"
for (i in 1:n)
{
# P(C = j | X = x, Y = y) = prob
# "The probability that the class (C) is orange (j) when X is some x, and Y is some y"
# Two predictors that influence classification: x, y
# If x and y are both under 50, there is a 90% chance of being orange (grouping)
# If x and y and both over 50, or if one of them is over 50, grouping is blue
# Algorithm favours whichever grouping has a higher chance of success, then plots using that colour
# When prob (from above) is 50%, the boundary is drawn
percentChance <- 0
if (x[i] < 50 && y[i] < 50)
{
# 95% chance of orange and 5% chance of blue
# Bayes Decision Boundary therefore assigns to orange when x < 50 and y < 50
# "colours" is the Decision Boundary grouping, not the plotted grouping
percentChance <- 95
colours[i] <- "orange"
}
else
{
percentChance <- 10
colours[i] <- "blue"
}
if (round(runif(1, 1, 100)) > percentChance)
{
classes[i] <- "blue"
}
else
{
classes[i] <- "orange"
}
}
boundary.x <- seq(0, 100, by=1)
boundary.y <- 0
for (i in 1:101)
{
if (i > 49)
{
boundary.y[i] <- -10 # just for the sake of visual consistency, real value is 0
}
else
{
boundary.y[i] <- 50
}
}
df <- data.frame(boundary.x, boundary.y)
plot(x, y, col=classes)
lines(df, type="l", lty=2, lwd=2, col="red")
# ============================================================
# K-Nearest neighbour code
# ============================================================
#library(class)
set.seed(1)
x <- round(runif(n, 1, n))
set.seed(2)
y <- round(runif(n, 1, n))
train.df <- data.frame(x, y)
set.seed(1)
x.test <- round(runif(n, 1, n))
set.seed(2)
y.test <- round(runif(n, 1, n))
test.df <- data.frame(x.test, y.test)
我认为主要问题出在这里。我认为您想将从贝叶斯 class 运算符获得的 class 标签传递给 knn
,即向量 classes
。相反,您传递的 cl
只是 test.df
中案例的顺序标签,即没有意义。
#cl <- factor(c(rep("blue", 50), rep("orange", 50)))
k <- knn(train.df, test.df, classes, k=25)
plot(test.df$x.test, test.df$y.test, col=k)
我已经在 R 中生成了一些数据并将贝叶斯分类器应用于这些点。它们都被分类为 "orange" 或 "blue"。我无法从 knn
函数获得准确的结果,因为我认为 类("blue"、"orange")没有正确链接到 knn
。
我的训练数据在一个数据框中(x, y)
。我的 类 在一个单独的数组中。对于贝叶斯分类器,我是这样做的——它更容易绘制。但是,现在我不知道如何将我的 "plug in" 我的 类 变成 knn
。使用以下代码是非常不准确的。我已将 k
更改为许多不同的测试值,所有值都不准确。
library(class)
x <- round(runif(100, 1, 100))
y <- round(runif(100, 1, 100))
train.df <- data.frame(x, y)
x.test <- round(runif(100, 1, 100))
y.test <- round(runif(100, 1, 100))
test.df <- data.frame(x.test, y.test)
cl <- factor(c(rep("blue", 50), rep("orange", 50)))
k <- knn(train.df, test.df, cl, k=100)
同样,我排序的 类 位于数组 classes
中,在代码的更上方。
这是我的完整文档。上面的代码在最底部。
library(class)
n <- 100
x <- round(runif(n, 1, n))
y <- round(runif(n, 1, n))
# ============================================================
# Bayes Classifier + Decision Boundary Code
# ============================================================
classes <- "null"
colours <- "null"
for (i in 1:n)
{
# P(C = j | X = x, Y = y) = prob
# "The probability that the class (C) is orange (j) when X is some x, and Y is some y"
# Two predictors that influence classification: x, y
# If x and y are both under 50, there is a 90% chance of being orange (grouping)
# If x and y and both over 50, or if one of them is over 50, grouping is blue
# Algorithm favours whichever grouping has a higher chance of success, then plots using that colour
# When prob (from above) is 50%, the boundary is drawn
percentChance <- 0
if (x[i] < 50 && y[i] < 50)
{
# 95% chance of orange and 5% chance of blue
# Bayes Decision Boundary therefore assigns to orange when x < 50 and y < 50
# "colours" is the Decision Boundary grouping, not the plotted grouping
percentChance <- 95
colours[i] <- "orange"
}
else
{
percentChance <- 10
colours[i] <- "blue"
}
if (round(runif(1, 1, 100)) > percentChance)
{
classes[i] <- "blue"
}
else
{
classes[i] <- "orange"
}
}
boundary.x <- seq(0, 100, by=1)
boundary.y <- 0
for (i in 1:101)
{
if (i > 49)
{
boundary.y[i] <- -10 # just for the sake of visual consistency, real value is 0
}
else
{
boundary.y[i] <- 50
}
}
df <- data.frame(boundary.x, boundary.y)
plot(x, y, col=classes)
lines(df, type="l", lty=2, lwd=2, col="red")
# ============================================================
# K-Nearest neighbour code
# ============================================================
#library(class)
#x <- round(runif(100, 1, 100))
#y <- round(runif(100, 1, 100))
train.df <- data.frame(x, y)
x.test <- round(runif(n, 1, n))
y.test <- round(runif(n, 1, n))
test.df <- data.frame(x.test, y.test)
cl <- factor(c(rep("blue", 50), rep("orange", 50)))
k <- knn(train.df, test.df, cl, k=(round(sqrt(n))))
感谢您的帮助
首先,为了可重复性,您应该在生成一组随机数之前设置一个种子,就像 runif
或 运行 任何 simulations/ML 随机算法所做的那样。请注意,在下面的代码中,我们为生成 x
的所有实例设置相同的种子,并为生成 y
的所有实例设置不同的种子。这样,伪随机生成的 x
总是相同的(但不同于 y
),y
.
library(class)
n <- 100
set.seed(1)
x <- round(runif(n, 1, n))
set.seed(2)
y <- round(runif(n, 1, n))
# ============================================================
# Bayes Classifier + Decision Boundary Code
# ============================================================
classes <- "null"
colours <- "null"
for (i in 1:n)
{
# P(C = j | X = x, Y = y) = prob
# "The probability that the class (C) is orange (j) when X is some x, and Y is some y"
# Two predictors that influence classification: x, y
# If x and y are both under 50, there is a 90% chance of being orange (grouping)
# If x and y and both over 50, or if one of them is over 50, grouping is blue
# Algorithm favours whichever grouping has a higher chance of success, then plots using that colour
# When prob (from above) is 50%, the boundary is drawn
percentChance <- 0
if (x[i] < 50 && y[i] < 50)
{
# 95% chance of orange and 5% chance of blue
# Bayes Decision Boundary therefore assigns to orange when x < 50 and y < 50
# "colours" is the Decision Boundary grouping, not the plotted grouping
percentChance <- 95
colours[i] <- "orange"
}
else
{
percentChance <- 10
colours[i] <- "blue"
}
if (round(runif(1, 1, 100)) > percentChance)
{
classes[i] <- "blue"
}
else
{
classes[i] <- "orange"
}
}
boundary.x <- seq(0, 100, by=1)
boundary.y <- 0
for (i in 1:101)
{
if (i > 49)
{
boundary.y[i] <- -10 # just for the sake of visual consistency, real value is 0
}
else
{
boundary.y[i] <- 50
}
}
df <- data.frame(boundary.x, boundary.y)
plot(x, y, col=classes)
lines(df, type="l", lty=2, lwd=2, col="red")
# ============================================================
# K-Nearest neighbour code
# ============================================================
#library(class)
set.seed(1)
x <- round(runif(n, 1, n))
set.seed(2)
y <- round(runif(n, 1, n))
train.df <- data.frame(x, y)
set.seed(1)
x.test <- round(runif(n, 1, n))
set.seed(2)
y.test <- round(runif(n, 1, n))
test.df <- data.frame(x.test, y.test)
我认为主要问题出在这里。我认为您想将从贝叶斯 class 运算符获得的 class 标签传递给 knn
,即向量 classes
。相反,您传递的 cl
只是 test.df
中案例的顺序标签,即没有意义。
#cl <- factor(c(rep("blue", 50), rep("orange", 50)))
k <- knn(train.df, test.df, classes, k=25)
plot(test.df$x.test, test.df$y.test, col=k)