没有从我的 Logistic 回归实现中获得正确的系数等值线图？

Question

我实现了逻辑回归并将其用于数据集。（这是 Coursera 的 ML 课程第 3 周（通常使用 matlab 和 Octave）中的练习，使用 python（所以这不是作弊））。

我从 sklearn 中的实现开始，对本课程第三周 (http://pastie.org/10872959) 中使用的数据集进行分类。这是一个小的、可复制的例子，任何人都可以尝试我使用的东西（它只依赖于 numpy 和 sklearn）：

它获取数据集，将其拆分为特征矩阵和输出矩阵，然后从原始的 2（即来自

构造另外 26 个特征
).然后我在 sklearn 中使用逻辑回归，但这并没有给出所需的等高线图（请参见下文）。

from sklearn.linear_model import LogisticRegression as expit import numpy as np def thetaFunc(y, theta, x): deg = 6 spot = 0 sum = 0 for i in range(1, deg + 1): for j in range(i + 1): sum += theta[spot] * x**(i - j) * y**(j) spot += 1 return sum def constructVariations(X, deg): features = np.zeros((len(X), 27)) spot = 0 for i in range(1, deg + 1): for j in range(i + 1): features[:, spot] = X[:,0]**(i - j) * X[:,1]**(j) spot += 1 return features if __name__ == '__main__': data = np.loadtxt("ex2points.txt", delimiter = ",") X,Y = np.split(data, [len(data[0,:]) - 1], 1) X = reg.constructVariations(X, 6) oneArray = np.ones((len(X),1)) X = np.hstack((oneArray, X)) trial = expit(solver = 'sag') trial = trial.fit(X = X,y = np.ravel(Y)) print(trial.coef_) # everything below has been edited in from matplotlib import pyplot as plt txt = open("RegLogTheta", "r").read() txt = txt.split() theta = np.array(txt, float) x = np.linspace(-1, 1.5, 100) y = np.linspace(-1,1.5,100) z = np.empty((100,100)) xx,yy = np.meshgrid(x,y) for i in range(len(x)): for j in range(len(y)): z[i][j] = thetaFunc(yy[i][j], theta, xx[i][j]) plt.contour(xx,yy,z, levels = [0]) plt.show()

这是通用特征项的系数。 http://pastie.org/10872957 (i.e the coefficients to terms

及其生成的轮廓：

一个潜在的错误来源是我误解了 trial._coeff 中存储的 7 X 4 矩阵系数矩阵。我相信这 28 个值是上面 28 "variations" 的系数，并且我已经将系数映射到列方向和行方向的变化。按列，我的意思是 [:][0] 映射到前 7 个变体，[:][1] 映射到接下来的 7 个，依此类推，我的函数 constructVariations 解释了如何系统地创建变体。现在API维护的不是shape (n_classes, n_features)的数组存储在trial._coeff中，那么我是否应该推断fit将数据分为四个类？还是我运行以其他方式糟糕地解决了这个问题？

更新

我对权重的解释（and/or 使用）肯定有问题：

我没有依赖 sklearn 内置的预测，而是尝试计算将以下设置为 1/2
的值

theta 的值是从打印 trial._coeff 中找到的值，x 和 y 是标量。然后绘制那些 x,y 以给出等高线。

我使用的代码（但最初没有添加）试图做到这一点。它背后的数学有什么问题？

Answer 1

One potential source of error is that I'm misinterpreting the 7 X 4 matrix coefficient matrix stored in trial._coeff

这个矩阵不是 7x4，而是 1x28（检查 print(trial.coef_.shape)）。 28 个特征中的每一个都有一个系数（constructVariations 返回 27 个，手动添加 1 个）。

so should I infer that fit classified the data into four classes?

不，你误解了数组，它只有一行（对于二进制分类来说，有两行没有意义）。

Or have I run through this problem poorly in another way?

代码很好，解释不行。特别是，查看模型中的实际决策边界（通过调用 "predict" 和绘制等高线绘制）

from sklearn.linear_model import LogisticRegression as expit
import numpy as np

def constructVariations(X, deg):

    features = np.zeros((len(X), 27)) 
    spot = 0

    for i in range(1, deg + 1):
        for j in range(i + 1):

            features[:, spot] = X[:,0]**(i - j) * X[:,1]**(j)
            spot += 1

    return features

if __name__ == '__main__':
    data = np.loadtxt("ex2points.txt", delimiter = ",")
    X,Y = np.split(data, [len(data[0,:]) - 1], 1)
    rawX = np.copy(X)    
    X = constructVariations(X, 6)

    oneArray = np.ones((len(X),1))
    X = np.hstack((oneArray, X))
    trial = expit(solver = 'sag')
    trial = trial.fit(X = X,y = np.ravel(Y))
    print(trial.coef_)

    from matplotlib import pyplot as plt

    h = 0.01
    x_min, x_max = rawX[:, 0].min() - 1, rawX[:, 0].max() + 1
    y_min, y_max = rawX[:, 1].min() - 1, rawX[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    data = constructVariations(np.c_[xx.ravel(), yy.ravel()], 6)
    oneArray = np.ones((len(data),1))
    data = np.hstack((oneArray, data))
    Z = trial.predict(data)
    Z = Z.reshape(xx.shape)

    plt.figure()
    plt.scatter(rawX[:, 0], rawX[:, 1], c=Y, linewidth=0, s=50)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
    plt.show()

更新

在提供的代码中，您忘记（在可视化中）添加了“1"s to your data representation, thus your thetas are one "off”列，因为 theta[0] 是偏差，theta1 与您的 0 有关'th 变量等

def thetaFunc(y, theta, x):

    deg = 6

    spot = 0
    sum = theta[spot]

    spot += 1
    for i in range(1, deg + 1):
        for j in range(i + 1):
            sum += theta[spot] * x**(i - j) * y**(j)
            spot += 1
    return sum

您还忘记了逻辑回归本身的拦截项，因此

xx,yy = np.meshgrid(x,y)
for i in range(len(x)):
     for j in range(len(y)):
         z[i][j] = thetaFunc(yy[i][j], theta, xx[i][j])
z -= trial.intercept_

（使用您的固定代码生成的图像）

import numpy as np
from sklearn.linear_model import LogisticRegression as expit

def thetaFunc(y, theta, x):

    deg = 6

    spot = 0
    sum = theta[spot]

    spot += 1
    for i in range(1, deg + 1):
        for j in range(i + 1):
            sum += theta[spot] * x**(i - j) * y**(j)
            spot += 1
    return np.exp(-sum)


def constructVariations(X, deg):

    features = np.zeros((len(X), 27)) 
    spot = 0

    for i in range(1, deg + 1):
        for j in range(i + 1):

            features[:, spot] = X[:,0]**(i - j) * X[:,1]**(j)
            spot += 1

    return features

if __name__ == '__main__':
    data = np.loadtxt("ex2points.txt", delimiter = ",")
    X,Y = np.split(data, [len(data[0,:]) - 1], 1)

    X = constructVariations(X, 6)
    rawX = np.copy(X)

    oneArray = np.ones((len(X),1))
    X = np.hstack((oneArray, X))
    trial = expit(solver = 'sag')
    trial = trial.fit(X = X,y = np.ravel(Y))

    from matplotlib import pyplot as plt

    theta = trial.coef_.ravel()

    x = np.linspace(-1, 1.5, 100)
    y = np.linspace(-1,1.5,100)
    z = np.empty((100,100))


    xx,yy = np.meshgrid(x,y)
    for i in range(len(x)):
         for j in range(len(y)):
             z[i][j] = thetaFunc(yy[i][j], theta, xx[i][j])
    z -= trial.intercept_

    plt.contour(xx,yy,z > 1,cmap=plt.cm.Paired, alpha=0.8)
    plt.scatter(rawX[:, 0], rawX[:, 1], c=Y, linewidth=0, s=50)
    plt.show()

没有从我的 Logistic 回归实现中获得正确的系数等值线图？

Not getting correct contour plot of coefficients from my Logistic Regression implementation?

python

numpy

machine-learning

contour

logistic-regression

更新