如何使用 Knn 分类器求解预测概率?
how can I solve for predicted probabilities using Knn classifier?
我在数据集上使用 KNN 分类器,并试图找到每个预测结果的预测概率,但我不确定如何去做。我没有找到太多关于这个话题的信息。我使用的代码是:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn import preprocessing
from sklearn import neighbors
from sklearn.metrics import *
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.cluster import MiniBatchKMeans
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
Adult.loc[Adult.loc[:, "race"] == "Amer-Indian-Eskimo", "race"] = "Other" #consolidating catagorical data in the race column
Adult.loc[:,"race"].value_counts().plot(kind='bar') #plotting the consolidated catagorical data in the race column
plt.title('race after consolidation')
plt.show()
Adult.loc[:, "White"] = (Adult.loc[:, "race"] == "White").astype(int) #One hot encoding the catagorical/creating new categorical data in the race column
Adult.loc[:, "Black"] = (Adult.loc[:, "race"] == "Black").astype(int)
Adult.loc[:, "Asian-Pac-Islander"] = (Adult.loc[:, "race"] == "Asian-Pac-Islander").astype(int)
Adult.loc[:, "Other"] = (Adult.loc[:, "race"] == "Other").astype(int)
Adult.loc[:,"Other"] #Verifying One-hot encoding for Other column
Adult = Adult.drop("race", axis=1) #removing the obsolete column "race"
Minage = min(Adult.loc[:,"age"]) #MinMax normilizing the age column
Maxage = max(Adult.loc[:,"age"])
MinMaxage = (Adult.loc[:,"age"] - Minage)/(Maxage - Minage)
df2 = pd.DataFrame() #creating a dataframe to plot the normilized data
df2.loc[:,0] = Adult.loc[:, "White"] #filling the data frame
df2.loc[:,1] = NormZ1
df2.loc[:,1] = MinMaxage #assigning new columns for df2
df2.loc[:,2] = Adult.loc[:,"hoursperweek"]
df2.columns = ["White","MinMaxage","hoursperweek"] #labeling the columns for df2
df2.head() #checkiung new dataframe
X = np.array(df2.drop(["hoursperweek"], 1)) #choosing the expert label to predict and not including the label in the X array
y = np.array(df2["hoursperweek"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) #splittting the data into training and testing data
clf = neighbors.KNeighborsClassifier() #assigning K neighbors classifier
clf.fit(X_train, y_train) #fitting the data for X_train and y_train
accuracy = clf.score(X_test, y_test) #finding the accuracy of the prediction
print("accuracy rate with age MinMax Normilized")
print(accuracy)
print ('predictions for test set with age MinMax Normilized:') #showing results
print(clf.predict(X_test))
print ('actual class values with age MinMax Normilized:')
print(y_test)
我已将实际结果和预测结果加载到新数据框中,并想在新数据框中添加第三列,其中包含每一行的预测概率,但我不确定如何解决这些问题在 python。有没有办法解决每个结果的预测概率?我想使用混淆矩阵和 ROC 曲线的预测概率。
您可以尝试clf.predict_boba(X_test)
来获得预测概率。 source
从上面的问题代码中,我已经删除了
df2.loc[:,1] = NormZ1
并重新运行代码,使用语法
print(clf.predict_proba(X_test))
能够得到形状为 (6513, 93) 的概率
我在数据集上使用 KNN 分类器,并试图找到每个预测结果的预测概率,但我不确定如何去做。我没有找到太多关于这个话题的信息。我使用的代码是:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn import preprocessing
from sklearn import neighbors
from sklearn.metrics import *
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.cluster import MiniBatchKMeans
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
Adult.loc[Adult.loc[:, "race"] == "Amer-Indian-Eskimo", "race"] = "Other" #consolidating catagorical data in the race column
Adult.loc[:,"race"].value_counts().plot(kind='bar') #plotting the consolidated catagorical data in the race column
plt.title('race after consolidation')
plt.show()
Adult.loc[:, "White"] = (Adult.loc[:, "race"] == "White").astype(int) #One hot encoding the catagorical/creating new categorical data in the race column
Adult.loc[:, "Black"] = (Adult.loc[:, "race"] == "Black").astype(int)
Adult.loc[:, "Asian-Pac-Islander"] = (Adult.loc[:, "race"] == "Asian-Pac-Islander").astype(int)
Adult.loc[:, "Other"] = (Adult.loc[:, "race"] == "Other").astype(int)
Adult.loc[:,"Other"] #Verifying One-hot encoding for Other column
Adult = Adult.drop("race", axis=1) #removing the obsolete column "race"
Minage = min(Adult.loc[:,"age"]) #MinMax normilizing the age column
Maxage = max(Adult.loc[:,"age"])
MinMaxage = (Adult.loc[:,"age"] - Minage)/(Maxage - Minage)
df2 = pd.DataFrame() #creating a dataframe to plot the normilized data
df2.loc[:,0] = Adult.loc[:, "White"] #filling the data frame
df2.loc[:,1] = NormZ1
df2.loc[:,1] = MinMaxage #assigning new columns for df2
df2.loc[:,2] = Adult.loc[:,"hoursperweek"]
df2.columns = ["White","MinMaxage","hoursperweek"] #labeling the columns for df2
df2.head() #checkiung new dataframe
X = np.array(df2.drop(["hoursperweek"], 1)) #choosing the expert label to predict and not including the label in the X array
y = np.array(df2["hoursperweek"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) #splittting the data into training and testing data
clf = neighbors.KNeighborsClassifier() #assigning K neighbors classifier
clf.fit(X_train, y_train) #fitting the data for X_train and y_train
accuracy = clf.score(X_test, y_test) #finding the accuracy of the prediction
print("accuracy rate with age MinMax Normilized")
print(accuracy)
print ('predictions for test set with age MinMax Normilized:') #showing results
print(clf.predict(X_test))
print ('actual class values with age MinMax Normilized:')
print(y_test)
我已将实际结果和预测结果加载到新数据框中,并想在新数据框中添加第三列,其中包含每一行的预测概率,但我不确定如何解决这些问题在 python。有没有办法解决每个结果的预测概率?我想使用混淆矩阵和 ROC 曲线的预测概率。
您可以尝试clf.predict_boba(X_test)
来获得预测概率。 source
从上面的问题代码中,我已经删除了
df2.loc[:,1] = NormZ1
并重新运行代码,使用语法
print(clf.predict_proba(X_test))
能够得到形状为 (6513, 93) 的概率