python 中的 sklearn 朴素贝叶斯
sklearn Naive Bayes in python
我已经在 'Rocks and Mines' 数据集上训练了一个分类器
(https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data)
在计算准确度分数时,它似乎总是非常准确(输出为 1.0),这让我难以置信。我是不是犯了什么错误,还是朴素贝叶斯有这么强大?
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
data = urllib.request.urlopen(url)
df = pd.read_csv(data)
# replace R and M with 1 and 0
m = len(df.iloc[:, -1])
Y = df.iloc[:, -1].values
y_val = []
for i in range(m):
if Y[i] == 'M':
y_val.append(1)
else:
y_val.append(0)
df = df.drop(df.columns[-1], axis = 1) # dropping column containing 'R', 'M'
X = df.values
from sklearn.model_selection import train_test_split
# initializing the classifier
clf = GaussianNB()
# splitting the data
train_x, test_x, train_y, test_y = train_test_split(X, y_val, test_size = 0.33, random_state = 42)
# training the classifier
clf.fit(train_x, train_y)
pred = clf.predict(test_x) # making a prediction
from sklearn.metrics import accuracy_score
score = accuracy_score(pred, test_y)
# printing the accuracy score
print(score)
X 是输入,y_val 是输出(我已将 'R' 和 'M' 转换为 0 和 1)
这是因为 random_state 参数在 train_test_split() 函数中。
当您将 random_state
设置为整数时,sklearn 可确保您的数据采样保持不变。
这意味着每次您通过指定 random_state 运行 时,您都会得到相同的结果,这是预期的行为。
参考docs了解更多详情。
我已经在 'Rocks and Mines' 数据集上训练了一个分类器 (https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data) 在计算准确度分数时,它似乎总是非常准确(输出为 1.0),这让我难以置信。我是不是犯了什么错误,还是朴素贝叶斯有这么强大?
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
data = urllib.request.urlopen(url)
df = pd.read_csv(data)
# replace R and M with 1 and 0
m = len(df.iloc[:, -1])
Y = df.iloc[:, -1].values
y_val = []
for i in range(m):
if Y[i] == 'M':
y_val.append(1)
else:
y_val.append(0)
df = df.drop(df.columns[-1], axis = 1) # dropping column containing 'R', 'M'
X = df.values
from sklearn.model_selection import train_test_split
# initializing the classifier
clf = GaussianNB()
# splitting the data
train_x, test_x, train_y, test_y = train_test_split(X, y_val, test_size = 0.33, random_state = 42)
# training the classifier
clf.fit(train_x, train_y)
pred = clf.predict(test_x) # making a prediction
from sklearn.metrics import accuracy_score
score = accuracy_score(pred, test_y)
# printing the accuracy score
print(score)
X 是输入,y_val 是输出(我已将 'R' 和 'M' 转换为 0 和 1)
这是因为 random_state 参数在 train_test_split() 函数中。
当您将 random_state
设置为整数时,sklearn 可确保您的数据采样保持不变。
这意味着每次您通过指定 random_state 运行 时,您都会得到相同的结果,这是预期的行为。
参考docs了解更多详情。