逻辑回归中的成本函数给出了 nan 值
The cost function in logistic regression is giving nan values
我正在 python 中从头开始使用梯度下降实现逻辑回归。我正在研究 breast cancer dataset。在计算成本时,我只得到 nan 值。我尝试标准化我的数据并尝试降低我的 alpha 值,但没有效果。尽管如此,我还是得到了 95.8% 的准确率。感觉有些不对劲。下面给出了我的一些代码:
def hypothesis(b,X):
z= np.dot(X,b)
#print(z)
return sigmoid(z)
def sigmoid(z):
return 1/(1+np.exp(-1*z))
def FindCost(h,y):
r = y.shape[0]
cost = np.sum(y*np.log(h)+(1-y)*np.log(1-h))/r
#print(cost)
return cost*-1
def gradient_descent(X,y,alpha,epoch):
r = X.shape[0]
c = X.shape[1]
theta = np.ones((c,1))
min_cost=None
min_theta=[]
Cost_list=[]
for i in range(epoch):
h = hypothesis(theta,X)
grad = np.dot(X.T,(h-y))
theta = theta - alpha*grad
cost = FindCost(h,y)
Cost_list.append(cost)
if min_cost is None or min_cost>cost:
min_cost=cost
min_theta=list(theta)
return min_theta,Cost_list
def calAccuracy(theta,X,y):
h = hypothesis(theta,X)
correct=0
for i in range(y.shape[0]):
if h[i]>=0.5:
if y[i]==1: correct+=1
print("predicted: ",1,end='\t\t')
elif h[i]<0.5:
if y[i]==0: correct+=1
print("predicted: ",0,end='\t\t')
print("actual: ",y[i])
return correct*100/y.shape[0]
alpha = 0.01
epoch = 1000
theta,cost = gradient_descent(x_train,y_train,alpha,epoch)
accuracy = calAccuracy(theta,x_test,y_test)
print(f"the accuracy of the model: {accuracy} %")
Output:
the accuracy of the model: 95.8041958041958 %
数据集标准化:
for i in range(x_train.shape[1]):
x_train[i] = (x_train[i]-np.mean(x_train[i]))/np.std(x_train[i])
x_test[i] = (x_test[i]-np.mean(x_test[i]))/np.std(x_test[i])
数据集如下所示:
image
在标准化我的数据集后,我将一列(第一列)连接到我的 x_train 和 x_test。
x train data:
[[ 1.00000000e+00 -6.51198873e-01 -5.29762615e-01 4.19236602e-02
1.91948751e+00 -7.80449683e-01]
[ 1.00000000e+00 -6.85821055e-01 -4.00751146e-01 -3.29500919e-02
1.92747771e+00 -8.07955419e-01]
[ 1.00000000e+00 -6.76114725e-01 -4.04963490e-01 -6.16161982e-02
1.93556890e+00 -7.92874483e-01]
y train data:
[[1.]
[1.]
[1.]
x test data:
[[ 1.00000000e+00 -5.63066669e-01 -5.36144255e-01 -2.71074811e-01
1.98575469e+00 -6.15468953e-01]
[ 1.00000000e+00 -5.57037602e-01 -5.57366708e-01 -2.60414280e-01
1.98474749e+00 -6.09928901e-01]
[ 1.00000000e+00 -5.56192661e-01 -5.57892986e-01 -2.62657675e-01
1.98504143e+00 -6.08298112e-01]
y test data:
[[0.]
[1.]
[0.]
我在这里做错了什么以及如何防止显示 nan 值?
另外,如果我的成本函数给出的是 nan 值,我是如何获得如此高的准确性的?
P.S。我最初在我的数据集中没有空值,并将数据集转换为 numpy 数组。
所以问题是:
for i in range(x_train.shape[1]):
x_train[i] = (x_train[i]-np.mean(x_train[i]))/np.std(x_train[i])
x_test[i] = (x_test[i]-np.mean(x_test[i]))/np.std(x_test[i])
未按特征/按列进行标准化。
你应该改用这个:
for i in range(x_train.shape[1]):
x_train[:,i] = (x_train[:,i]-np.mean(x_train,axis=1))/np.std(x_train,axis=1)
x_test[:,i] = (x_test[:,i]-np.mean(x_test,axis=1))/np.std(x_test,axis=1)
这将给出正确的成本值,准确度为:95.8041958041958 %
我正在 python 中从头开始使用梯度下降实现逻辑回归。我正在研究 breast cancer dataset。在计算成本时,我只得到 nan 值。我尝试标准化我的数据并尝试降低我的 alpha 值,但没有效果。尽管如此,我还是得到了 95.8% 的准确率。感觉有些不对劲。下面给出了我的一些代码:
def hypothesis(b,X):
z= np.dot(X,b)
#print(z)
return sigmoid(z)
def sigmoid(z):
return 1/(1+np.exp(-1*z))
def FindCost(h,y):
r = y.shape[0]
cost = np.sum(y*np.log(h)+(1-y)*np.log(1-h))/r
#print(cost)
return cost*-1
def gradient_descent(X,y,alpha,epoch):
r = X.shape[0]
c = X.shape[1]
theta = np.ones((c,1))
min_cost=None
min_theta=[]
Cost_list=[]
for i in range(epoch):
h = hypothesis(theta,X)
grad = np.dot(X.T,(h-y))
theta = theta - alpha*grad
cost = FindCost(h,y)
Cost_list.append(cost)
if min_cost is None or min_cost>cost:
min_cost=cost
min_theta=list(theta)
return min_theta,Cost_list
def calAccuracy(theta,X,y):
h = hypothesis(theta,X)
correct=0
for i in range(y.shape[0]):
if h[i]>=0.5:
if y[i]==1: correct+=1
print("predicted: ",1,end='\t\t')
elif h[i]<0.5:
if y[i]==0: correct+=1
print("predicted: ",0,end='\t\t')
print("actual: ",y[i])
return correct*100/y.shape[0]
alpha = 0.01
epoch = 1000
theta,cost = gradient_descent(x_train,y_train,alpha,epoch)
accuracy = calAccuracy(theta,x_test,y_test)
print(f"the accuracy of the model: {accuracy} %")
Output:
the accuracy of the model: 95.8041958041958 %
数据集标准化:
for i in range(x_train.shape[1]):
x_train[i] = (x_train[i]-np.mean(x_train[i]))/np.std(x_train[i])
x_test[i] = (x_test[i]-np.mean(x_test[i]))/np.std(x_test[i])
数据集如下所示: image
在标准化我的数据集后,我将一列(第一列)连接到我的 x_train 和 x_test。
x train data:
[[ 1.00000000e+00 -6.51198873e-01 -5.29762615e-01 4.19236602e-02
1.91948751e+00 -7.80449683e-01]
[ 1.00000000e+00 -6.85821055e-01 -4.00751146e-01 -3.29500919e-02
1.92747771e+00 -8.07955419e-01]
[ 1.00000000e+00 -6.76114725e-01 -4.04963490e-01 -6.16161982e-02
1.93556890e+00 -7.92874483e-01]
y train data:
[[1.]
[1.]
[1.]
x test data:
[[ 1.00000000e+00 -5.63066669e-01 -5.36144255e-01 -2.71074811e-01
1.98575469e+00 -6.15468953e-01]
[ 1.00000000e+00 -5.57037602e-01 -5.57366708e-01 -2.60414280e-01
1.98474749e+00 -6.09928901e-01]
[ 1.00000000e+00 -5.56192661e-01 -5.57892986e-01 -2.62657675e-01
1.98504143e+00 -6.08298112e-01]
y test data:
[[0.]
[1.]
[0.]
我在这里做错了什么以及如何防止显示 nan 值? 另外,如果我的成本函数给出的是 nan 值,我是如何获得如此高的准确性的?
P.S。我最初在我的数据集中没有空值,并将数据集转换为 numpy 数组。
所以问题是:
for i in range(x_train.shape[1]):
x_train[i] = (x_train[i]-np.mean(x_train[i]))/np.std(x_train[i])
x_test[i] = (x_test[i]-np.mean(x_test[i]))/np.std(x_test[i])
未按特征/按列进行标准化。 你应该改用这个:
for i in range(x_train.shape[1]):
x_train[:,i] = (x_train[:,i]-np.mean(x_train,axis=1))/np.std(x_train,axis=1)
x_test[:,i] = (x_test[:,i]-np.mean(x_test,axis=1))/np.std(x_test,axis=1)
这将给出正确的成本值,准确度为:95.8041958041958 %