最近邻分类算法的 NumPy 实现以完全相同的方式对所有内容进行分类
NumPy implementation of the Nearest Neighbor classification algorithm classifies everything the exact same way
我的任务是使用 NumPy 使用 K 最近邻算法根据某物的各种特征(例如茎长、花瓣长度等)确定它是哪种花。 (郑重声明,我过去使用过 Python,虽然它不是我的 "best" 语言;但是,我对 NumPy 完全陌生)。
我的训练和测试数据都在 CSV 中,如下所示:
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
我知道如何做基本算法。这是我为其创建的 C#:
namespace Project_3_Prototype
{
public class FourD
{
public double f1, f2, f3, f4;
public string name;
public static double Distance(FourD a, FourD b)
{
double squared = Math.Pow(a.f1 - b.f1, 2) + Math.Pow(a.f2 - b.f2, 2) + Math.Pow(a.f3 - b.f3, 2) + Math.Pow(a.f4 - b.f4, 2);
return Math.Sqrt(squared);
}
}
class Program
{
static void Main(string[] args)
{
List<FourD> distances = new List<FourD>();
using (var parser = new TextFieldParser("iris-training-data.csv"))
{
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
var curr = new FourD
{
f1 = double.Parse(fields[0]),
f2 = double.Parse(fields[1]),
f3 = double.Parse(fields[2]),
f4 = double.Parse(fields[3]),
name = fields[4]
};
distances.Add(curr);
}
}
double correct = 0, total = 0;
using (var parser = new TextFieldParser("iris-testing-data.csv"))
{
parser.SetDelimiters(",");
int i = 1;
while (!parser.EndOfData)
{
total++;
string[] fields = parser.ReadFields();
var curr = new FourD
{
f1 = double.Parse(fields[0]),
f2 = double.Parse(fields[1]),
f3 = double.Parse(fields[2]),
f4 = double.Parse(fields[3]),
name = fields[4]
};
FourD min = distances[0];
foreach (FourD comp in distances)
{
if (FourD.Distance(comp, curr) < FourD.Distance(min, curr))
{
min = comp;
}
}
if (min.name == curr.name)
{
correct++;
}
Console.WriteLine(string.Format("{0},{1},{2}", i, curr.name, min.name));
i++;
}
}
Console.WriteLine("Accuracy: " + correct / total);
Console.ReadLine();
}
}
}
这完全符合预期,输出如下:
# The format is Number,Correct label,Predicted Label
1,Iris-setosa,Iris-setosa
2,Iris-setosa,Iris-setosa
3,Iris-setosa,Iris-setosa
4,Iris-setosa,Iris-setosa
5,Iris-setosa,Iris-setosa
6,Iris-setosa,Iris-setosa
7,Iris-setosa,Iris-setosa
8,Iris-setosa,Iris-setosa
9,Iris-setosa,Iris-setosa
10,Iris-setosa,Iris-setosa
11,Iris-setosa,Iris-setosa
12,Iris-setosa,Iris-setosa
...
Accuracy: 0.946666666666667
我正在尝试在 NumPy 中做同样的事情。但是,作业不允许我使用 for
循环,只能使用向量化函数。
所以,基本上我想做的是:对于测试数据中的每一行,获取训练数据中最接近它的行的索引(即具有最小欧氏距离)。
这是我在 Python 中尝试的方法:
import numpy as np
def main():
# Split each line of the CSV into a list of attributes and labels
data = [x.split(',') for x in open("iris-training-data.csv")]
# The last item is the label
labels = np.array([x[-1].rstrip() for x in data])
# Convert the first 3 items to a 2D array of floats
floats = np.array([x[0:3] for x in data]).astype(float)
classifyTrainingExamples(labels, floats)
def classifyTrainingExamples(labels, floats):
# We're basically doing the same thing to the testing data that we did to the training data
testingData = [x.split(',') for x in open("iris-testing-data.csv")]
testingLabels = np.array([x[-1].rstrip() for x in testingData])
testingFloats = np.array([x[0:3] for x in testingData]).astype(float)
res = np.apply_along_axis(lambda x: closest(floats, x), 1, testingFloats)
correct = 0
for number, index in enumerate(res):
if labels[index] == testingLabels[number]:
correct += 1
print("{},{},{}".format(number + 1, testingLabels[number], labels[index]))
number += 1
print(correct / len(list(res)))
def closest(otherArray, item):
res = np.apply_along_axis(lambda x: distance(x, item), 1, otherArray)
i = np.argmin(res)
return i
# Get the Euclidean distance between two "flat" lists (i.e. one particular row
def distance(a, b):
# Subtract one from the other elementwise, then raise each one to the power of 2
lst = (a - b) ** 2
# Sum all of the elements together, and take the square root
result = np.sqrt(lst.sum())
return result
main()
不幸的是,输出看起来像
1,Iris-setosa,Iris-setosa
2,Iris-setosa,Iris-setosa
3,Iris-setosa,Iris-setosa
4,Iris-setosa,Iris-setosa
....
74,Iris-setosa,Iris-setosa
75,Iris-setosa,Iris-setosa
0.93333333
每行只有Iris-setosa
个标签,精度为0.9333333。
我尝试使用调试器单步调试,每个项都被if
语句计算为正确(但正确率仍然显示为0.93333333).
所以基本上:
- 显示每个结果都是 "correct"(显然不是)。
- 每个值都显示
Iris-setosa
- 我的百分比显示为 93%。正确的值实际上大约是 94%,但我希望这会显示 100%,因为每个结果都应该是 "correct."
有人可以帮我看看我在这里遗漏了什么吗?
在任何人询问之前,为了记录,是的,我确实尝试使用调试器逐步完成此操作 :) 另外为了记录,是的,这是作业。
如果你真的想在一行中完成,这里是你可以做的(我从 scikit-learn 下载了数据集):
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
# Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split training and test set
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
# 1-neareast neighbour
ypred = np.array([ytrain[np.argmin(np.sum((x-Xtrain)**2,axis=1))] for x in Xtest])
# Compute classification error
sum(ypred != ytest)/ len(ytest)
现在,这是 1 个最近的邻居,它只查看训练集中最近的点。对于k-最近邻,你必须把它改成这样:
# k-neareast neighbour
k = 3
ypredk = np.array([np.argmax(np.bincount(ytrain[np.argsort(np.sum((x-Xtrain)**2,axis=1))[0:k]])) for x in Xtest])
sum(ypredk != ytest)/ len(ytest)
换句话说,你对距离进行排序,找到 k 个最小值的索引(即 np.argsort
部分)和相应的标签,然后你在其中寻找最常见的标签k 个(即 np.argmax(np.bincount(x))
部分)。
最后,如果想确定,可以和scikit-learn
对比一下:
# scikit-learn NN
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=k, algorithm='ball_tree')
knn.fit(Xtrain,ytrain)
ypred_sklearn = knn.predict(Xtest)
sum(ypred_sklearn != ytest)/ len(ytest)
我的任务是使用 NumPy 使用 K 最近邻算法根据某物的各种特征(例如茎长、花瓣长度等)确定它是哪种花。 (郑重声明,我过去使用过 Python,虽然它不是我的 "best" 语言;但是,我对 NumPy 完全陌生)。
我的训练和测试数据都在 CSV 中,如下所示:
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
我知道如何做基本算法。这是我为其创建的 C#:
namespace Project_3_Prototype
{
public class FourD
{
public double f1, f2, f3, f4;
public string name;
public static double Distance(FourD a, FourD b)
{
double squared = Math.Pow(a.f1 - b.f1, 2) + Math.Pow(a.f2 - b.f2, 2) + Math.Pow(a.f3 - b.f3, 2) + Math.Pow(a.f4 - b.f4, 2);
return Math.Sqrt(squared);
}
}
class Program
{
static void Main(string[] args)
{
List<FourD> distances = new List<FourD>();
using (var parser = new TextFieldParser("iris-training-data.csv"))
{
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
var curr = new FourD
{
f1 = double.Parse(fields[0]),
f2 = double.Parse(fields[1]),
f3 = double.Parse(fields[2]),
f4 = double.Parse(fields[3]),
name = fields[4]
};
distances.Add(curr);
}
}
double correct = 0, total = 0;
using (var parser = new TextFieldParser("iris-testing-data.csv"))
{
parser.SetDelimiters(",");
int i = 1;
while (!parser.EndOfData)
{
total++;
string[] fields = parser.ReadFields();
var curr = new FourD
{
f1 = double.Parse(fields[0]),
f2 = double.Parse(fields[1]),
f3 = double.Parse(fields[2]),
f4 = double.Parse(fields[3]),
name = fields[4]
};
FourD min = distances[0];
foreach (FourD comp in distances)
{
if (FourD.Distance(comp, curr) < FourD.Distance(min, curr))
{
min = comp;
}
}
if (min.name == curr.name)
{
correct++;
}
Console.WriteLine(string.Format("{0},{1},{2}", i, curr.name, min.name));
i++;
}
}
Console.WriteLine("Accuracy: " + correct / total);
Console.ReadLine();
}
}
}
这完全符合预期,输出如下:
# The format is Number,Correct label,Predicted Label
1,Iris-setosa,Iris-setosa
2,Iris-setosa,Iris-setosa
3,Iris-setosa,Iris-setosa
4,Iris-setosa,Iris-setosa
5,Iris-setosa,Iris-setosa
6,Iris-setosa,Iris-setosa
7,Iris-setosa,Iris-setosa
8,Iris-setosa,Iris-setosa
9,Iris-setosa,Iris-setosa
10,Iris-setosa,Iris-setosa
11,Iris-setosa,Iris-setosa
12,Iris-setosa,Iris-setosa
...
Accuracy: 0.946666666666667
我正在尝试在 NumPy 中做同样的事情。但是,作业不允许我使用 for
循环,只能使用向量化函数。
所以,基本上我想做的是:对于测试数据中的每一行,获取训练数据中最接近它的行的索引(即具有最小欧氏距离)。
这是我在 Python 中尝试的方法:
import numpy as np
def main():
# Split each line of the CSV into a list of attributes and labels
data = [x.split(',') for x in open("iris-training-data.csv")]
# The last item is the label
labels = np.array([x[-1].rstrip() for x in data])
# Convert the first 3 items to a 2D array of floats
floats = np.array([x[0:3] for x in data]).astype(float)
classifyTrainingExamples(labels, floats)
def classifyTrainingExamples(labels, floats):
# We're basically doing the same thing to the testing data that we did to the training data
testingData = [x.split(',') for x in open("iris-testing-data.csv")]
testingLabels = np.array([x[-1].rstrip() for x in testingData])
testingFloats = np.array([x[0:3] for x in testingData]).astype(float)
res = np.apply_along_axis(lambda x: closest(floats, x), 1, testingFloats)
correct = 0
for number, index in enumerate(res):
if labels[index] == testingLabels[number]:
correct += 1
print("{},{},{}".format(number + 1, testingLabels[number], labels[index]))
number += 1
print(correct / len(list(res)))
def closest(otherArray, item):
res = np.apply_along_axis(lambda x: distance(x, item), 1, otherArray)
i = np.argmin(res)
return i
# Get the Euclidean distance between two "flat" lists (i.e. one particular row
def distance(a, b):
# Subtract one from the other elementwise, then raise each one to the power of 2
lst = (a - b) ** 2
# Sum all of the elements together, and take the square root
result = np.sqrt(lst.sum())
return result
main()
不幸的是,输出看起来像
1,Iris-setosa,Iris-setosa
2,Iris-setosa,Iris-setosa
3,Iris-setosa,Iris-setosa
4,Iris-setosa,Iris-setosa
....
74,Iris-setosa,Iris-setosa
75,Iris-setosa,Iris-setosa
0.93333333
每行只有Iris-setosa
个标签,精度为0.9333333。
我尝试使用调试器单步调试,每个项都被if
语句计算为正确(但正确率仍然显示为0.93333333).
所以基本上:
- 显示每个结果都是 "correct"(显然不是)。
- 每个值都显示
Iris-setosa
- 我的百分比显示为 93%。正确的值实际上大约是 94%,但我希望这会显示 100%,因为每个结果都应该是 "correct."
有人可以帮我看看我在这里遗漏了什么吗?
在任何人询问之前,为了记录,是的,我确实尝试使用调试器逐步完成此操作 :) 另外为了记录,是的,这是作业。
如果你真的想在一行中完成,这里是你可以做的(我从 scikit-learn 下载了数据集):
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
# Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split training and test set
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
# 1-neareast neighbour
ypred = np.array([ytrain[np.argmin(np.sum((x-Xtrain)**2,axis=1))] for x in Xtest])
# Compute classification error
sum(ypred != ytest)/ len(ytest)
现在,这是 1 个最近的邻居,它只查看训练集中最近的点。对于k-最近邻,你必须把它改成这样:
# k-neareast neighbour
k = 3
ypredk = np.array([np.argmax(np.bincount(ytrain[np.argsort(np.sum((x-Xtrain)**2,axis=1))[0:k]])) for x in Xtest])
sum(ypredk != ytest)/ len(ytest)
换句话说,你对距离进行排序,找到 k 个最小值的索引(即 np.argsort
部分)和相应的标签,然后你在其中寻找最常见的标签k 个(即 np.argmax(np.bincount(x))
部分)。
最后,如果想确定,可以和scikit-learn
对比一下:
# scikit-learn NN
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=k, algorithm='ball_tree')
knn.fit(Xtrain,ytrain)
ypred_sklearn = knn.predict(Xtest)
sum(ypred_sklearn != ytest)/ len(ytest)