逻辑回归中的成本函数给出了 nan 值

The cost function in logistic regression is giving nan values

我正在 python 中从头开始使用梯度下降实现逻辑回归。我正在研究 breast cancer dataset。在计算成本时,我只得到 nan 值。我尝试标准化我的数据并尝试降低我的 alpha 值,但没有效果。尽管如此,我还是得到了 95.8% 的准确率。感觉有些不对劲。下面给出了我的一些代码:

def hypothesis(b,X):
    z= np.dot(X,b)
    #print(z)
    return sigmoid(z)

def sigmoid(z):
    return 1/(1+np.exp(-1*z))

def FindCost(h,y):
    r = y.shape[0]
    cost = np.sum(y*np.log(h)+(1-y)*np.log(1-h))/r
    #print(cost)
    return cost*-1
    
def gradient_descent(X,y,alpha,epoch):
    r = X.shape[0]
    c = X.shape[1]
    theta = np.ones((c,1))
    min_cost=None
    min_theta=[]
    Cost_list=[]
    for i in range(epoch):
        h = hypothesis(theta,X)
        grad = np.dot(X.T,(h-y))
        theta = theta - alpha*grad
        cost = FindCost(h,y)
        Cost_list.append(cost)
        if min_cost is None or min_cost>cost:
            min_cost=cost
            min_theta=list(theta)
    return min_theta,Cost_list   

def calAccuracy(theta,X,y):
    h = hypothesis(theta,X)
    correct=0
    for i in range(y.shape[0]):
        if h[i]>=0.5:
            if y[i]==1: correct+=1
            print("predicted: ",1,end='\t\t')
        elif h[i]<0.5:
            if y[i]==0: correct+=1
            print("predicted: ",0,end='\t\t')
        print("actual: ",y[i])
    return correct*100/y.shape[0]


alpha = 0.01
epoch = 1000
theta,cost = gradient_descent(x_train,y_train,alpha,epoch)
accuracy = calAccuracy(theta,x_test,y_test)
print(f"the accuracy of the model: {accuracy} %")

Output:
the accuracy of the model: 95.8041958041958 %

数据集标准化:

for i in range(x_train.shape[1]):
    x_train[i] = (x_train[i]-np.mean(x_train[i]))/np.std(x_train[i])
    x_test[i] = (x_test[i]-np.mean(x_test[i]))/np.std(x_test[i])

数据集如下所示: image

在标准化我的数据集后,我将一列(第一列)连接到我的 x_train 和 x_test。

x train data:
 [[ 1.00000000e+00 -6.51198873e-01 -5.29762615e-01  4.19236602e-02
   1.91948751e+00 -7.80449683e-01]
 [ 1.00000000e+00 -6.85821055e-01 -4.00751146e-01 -3.29500919e-02
   1.92747771e+00 -8.07955419e-01]
 [ 1.00000000e+00 -6.76114725e-01 -4.04963490e-01 -6.16161982e-02
   1.93556890e+00 -7.92874483e-01]

y train data:
[[1.]
 [1.]
 [1.]
    
x test data:
 [[ 1.00000000e+00 -5.63066669e-01 -5.36144255e-01 -2.71074811e-01
   1.98575469e+00 -6.15468953e-01]
 [ 1.00000000e+00 -5.57037602e-01 -5.57366708e-01 -2.60414280e-01
   1.98474749e+00 -6.09928901e-01]
 [ 1.00000000e+00 -5.56192661e-01 -5.57892986e-01 -2.62657675e-01
   1.98504143e+00 -6.08298112e-01]

y test data:
 [[0.]
 [1.]
 [0.]

我在这里做错了什么以及如何防止显示 nan 值? 另外,如果我的成本函数给出的是 nan 值,我是如何获得如此高的准确性的?

P.S。我最初在我的数据集中没有空值,并将数据集转换为 numpy 数组。

所以问题是:

for i in range(x_train.shape[1]):
    x_train[i] = (x_train[i]-np.mean(x_train[i]))/np.std(x_train[i])
    x_test[i] = (x_test[i]-np.mean(x_test[i]))/np.std(x_test[i])

未按特征/按列进行标准化。 你应该改用这个:

for i in range(x_train.shape[1]):
    x_train[:,i] = (x_train[:,i]-np.mean(x_train,axis=1))/np.std(x_train,axis=1)
    x_test[:,i] = (x_test[:,i]-np.mean(x_test,axis=1))/np.std(x_test,axis=1)

这将给出正确的成本值,准确度为:95.8041958041958 %