规范化后如何使用 float 数据类型进行分类?
How can I do classification with a float data type after normalization?
我正在处理一个标记为 Adult 的数据集,我正在尝试 运行 在我制作的一些列上的 KNN 进入一个新的数据框架并规范化一些列。我在尝试 运行
时遇到 ValueError: Unknown label type: 'continuous'
错误
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
在线研究错误后,似乎我需要在标准化后对数据使用标签编码器,因为它现在是 float
而不是 int
但我使用标签编码器时遇到问题。我使用的代码是:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
Adult.loc[Adult.loc[:, "race"] == "Amer-Indian-Eskimo", "race"] = "Other" #consolidating catagorical data in the race column
Adult.loc[:,"race"].value_counts().plot(kind='bar') #plotting the consolidated catagorical data in the race column
plt.title('race after consolidation')
plt.show()
Adult.loc[:, "White"] = (Adult.loc[:, "race"] == "White").astype(int) #One hot encoding the catagorical/creating new categorical data in the race column
Adult.loc[:, "Black"] = (Adult.loc[:, "race"] == "Black").astype(int)
Adult.loc[:, "Asian-Pac-Islander"] = (Adult.loc[:, "race"] == "Asian-Pac-Islander").astype(int)
Adult.loc[:, "Other"] = (Adult.loc[:, "race"] == "Other").astype(int)
Adult.loc[:,"Other"] #Verifying One-hot encoding for Other column
Adult = Adult.drop("race", axis=1) #removing the obsolete column "race"
Minage = min(Adult.loc[:,"age"]) #MinMax normalizing the age column
Maxage = max(Adult.loc[:,"age"])
MinMaxage = (Adult.loc[:,"age"] - Minage)/(Maxage - Minage)
Minhours = min(Adult.loc[:,"hoursperweek"]) #MinMax ormalizing the hoursperweek column
Maxhours = max(Adult.loc[:,"hoursperweek"])
MinMaxhours = (Adult.loc[:,"hoursperweek"] - Minhours)/(Maxhours - Minhours)
df2 = pd.DataFrame() #creating a dataframe to plot the normilized data
df2.loc[:,0] = Adult.loc[:, "White"] #filling the data frame
df2.loc[:,1] = MinMaxage
df2.loc[:,2] = MinMaxhours
df2.columns = ["White","MinMaxage","MinMaxhours"]
X = np.array(df2.drop(['MinMaxhours'], 1))
y = np.array(df2['MinMaxhours'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)
clf.predict(X_test)
y_test
有人可以帮我解决如何对数据进行标签编码以便我可以对数据执行 Knn 吗?我已经在 sklearn 网站和不同的示例中查找过它,但在我的数据集上使用它时仍然遇到问题。我在尝试拟合数据时收到错误消息 运行ning clf.fit(X_train, y_train)
看起来你遇到的是回归问题而不是分类问题。您正在尝试预测 MinMaxHours 变量,它是一个实数。如果您尝试预测实数,则应使用 Neirest 邻居算法的回归版本。为了获得预测,以下代码应该可以工作。
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor()
clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)
我正在处理一个标记为 Adult 的数据集,我正在尝试 运行 在我制作的一些列上的 KNN 进入一个新的数据框架并规范化一些列。我在尝试 运行
时遇到ValueError: Unknown label type: 'continuous'
错误
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
在线研究错误后,似乎我需要在标准化后对数据使用标签编码器,因为它现在是 float
而不是 int
但我使用标签编码器时遇到问题。我使用的代码是:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
Adult.loc[Adult.loc[:, "race"] == "Amer-Indian-Eskimo", "race"] = "Other" #consolidating catagorical data in the race column
Adult.loc[:,"race"].value_counts().plot(kind='bar') #plotting the consolidated catagorical data in the race column
plt.title('race after consolidation')
plt.show()
Adult.loc[:, "White"] = (Adult.loc[:, "race"] == "White").astype(int) #One hot encoding the catagorical/creating new categorical data in the race column
Adult.loc[:, "Black"] = (Adult.loc[:, "race"] == "Black").astype(int)
Adult.loc[:, "Asian-Pac-Islander"] = (Adult.loc[:, "race"] == "Asian-Pac-Islander").astype(int)
Adult.loc[:, "Other"] = (Adult.loc[:, "race"] == "Other").astype(int)
Adult.loc[:,"Other"] #Verifying One-hot encoding for Other column
Adult = Adult.drop("race", axis=1) #removing the obsolete column "race"
Minage = min(Adult.loc[:,"age"]) #MinMax normalizing the age column
Maxage = max(Adult.loc[:,"age"])
MinMaxage = (Adult.loc[:,"age"] - Minage)/(Maxage - Minage)
Minhours = min(Adult.loc[:,"hoursperweek"]) #MinMax ormalizing the hoursperweek column
Maxhours = max(Adult.loc[:,"hoursperweek"])
MinMaxhours = (Adult.loc[:,"hoursperweek"] - Minhours)/(Maxhours - Minhours)
df2 = pd.DataFrame() #creating a dataframe to plot the normilized data
df2.loc[:,0] = Adult.loc[:, "White"] #filling the data frame
df2.loc[:,1] = MinMaxage
df2.loc[:,2] = MinMaxhours
df2.columns = ["White","MinMaxage","MinMaxhours"]
X = np.array(df2.drop(['MinMaxhours'], 1))
y = np.array(df2['MinMaxhours'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)
clf.predict(X_test)
y_test
有人可以帮我解决如何对数据进行标签编码以便我可以对数据执行 Knn 吗?我已经在 sklearn 网站和不同的示例中查找过它,但在我的数据集上使用它时仍然遇到问题。我在尝试拟合数据时收到错误消息 运行ning clf.fit(X_train, y_train)
看起来你遇到的是回归问题而不是分类问题。您正在尝试预测 MinMaxHours 变量,它是一个实数。如果您尝试预测实数,则应使用 Neirest 邻居算法的回归版本。为了获得预测,以下代码应该可以工作。
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor()
clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)