scikit-learn 和 dask-ml LogisticRegression 的不同结果
Different results from scikit-learn and dask-ml LogisticRegression
当 运行 具有相同数据的相同 LogisticRegression 时,scikit-learn 和 dask-ml 实现之间的结果应该没有差异。
版本:
scikit-学习=0.21.2
dask-ml=1.0.0
首先使用 dask-ml LogisticRegression:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import metrics
from dask_yarn import YarnCluster
from dask.distributed import Client
from dask_ml.linear_model import LogisticRegression
import dask.dataframe as dd
import dask.array as da
digits = load_digits()
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)
lr = LogisticRegression(solver_kwargs={"normalize":False})
lr.fit(x_train, y_train)
score = lr.score(x_test, y_test)
print(score)
predictions = lr.predict(x_test)
cm = metrics.confusion_matrix(y_test, predictions)
print(cm)
现在有了 sklearn LogisticRegression:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import metrics
from dask_yarn import YarnCluster
from dask.distributed import Client
from sklearn.linear_model import LogisticRegression
import dask.dataframe as dd
import dask.array as da
digits = load_digits()
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)
lr = LogisticRegression()
lr.fit(x_train, y_train)
score = lr.score(x_test, y_test)
print(score)
predictions = lr.predict(x_test)
cm = metrics.confusion_matrix(y_test, predictions)
print(cm)
scikit-learn 的分数和卷积矩阵
0.9533333333333334
[[37 0 0 0 0 0 0 0 0 0]
[ 0 39 0 0 0 0 2 0 2 0]
[ 0 0 41 3 0 0 0 0 0 0]
[ 0 0 1 43 0 0 0 0 0 1]
[ 0 0 0 0 38 0 0 0 0 0]
[ 0 1 0 0 0 47 0 0 0 0]
[ 0 0 0 0 0 0 52 0 0 0]
[ 0 1 0 1 1 0 0 45 0 0]
[ 0 3 1 0 0 0 0 0 43 1]
[ 0 0 0 1 0 1 0 0 1 44]]
dask-ml 的得分和卷积矩阵
0.09555555555555556
[[ 0 37 0 0 0 0 0 0 0 0]
[ 0 43 0 0 0 0 0 0 0 0]
[ 0 44 0 0 0 0 0 0 0 0]
[ 0 45 0 0 0 0 0 0 0 0]
[ 0 38 0 0 0 0 0 0 0 0]
[ 0 48 0 0 0 0 0 0 0 0]
[ 0 52 0 0 0 0 0 0 0 0]
[ 0 48 0 0 0 0 0 0 0 0]
[ 0 48 0 0 0 0 0 0 0 0]
[ 0 47 0 0 0 0 0 0 0 0]]
从版本 dask_ml==1.0.0
开始,Dask-ml 不支持具有多个 类 的逻辑回归。使用原始示例的略微修改版本,如果您从适合的 dask-ml LogisticRegression
分类器打印 predictions
,您会看到它给出了一个填充有 True
的布尔数组。
from sklearn.datasets import load_digits
from dask_ml.linear_model import LogisticRegression
X, y = load_digits(return_X_y=True)
lr = LogisticRegression(solver_kwargs={"normalize": False})
lr.fit(X, y)
predictions = lr.predict(X)
print('predictions = {}'.format(predictions))
产出
predictions = [ True True True ... True True True]
这就是 dask-ml 和 scikit-learn 混淆矩阵彼此不同的原因。
上有一个相关的未解决问题
当 运行 具有相同数据的相同 LogisticRegression 时,scikit-learn 和 dask-ml 实现之间的结果应该没有差异。
版本:
scikit-学习=0.21.2
dask-ml=1.0.0
首先使用 dask-ml LogisticRegression:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import metrics
from dask_yarn import YarnCluster
from dask.distributed import Client
from dask_ml.linear_model import LogisticRegression
import dask.dataframe as dd
import dask.array as da
digits = load_digits()
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)
lr = LogisticRegression(solver_kwargs={"normalize":False})
lr.fit(x_train, y_train)
score = lr.score(x_test, y_test)
print(score)
predictions = lr.predict(x_test)
cm = metrics.confusion_matrix(y_test, predictions)
print(cm)
现在有了 sklearn LogisticRegression:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import metrics
from dask_yarn import YarnCluster
from dask.distributed import Client
from sklearn.linear_model import LogisticRegression
import dask.dataframe as dd
import dask.array as da
digits = load_digits()
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)
lr = LogisticRegression()
lr.fit(x_train, y_train)
score = lr.score(x_test, y_test)
print(score)
predictions = lr.predict(x_test)
cm = metrics.confusion_matrix(y_test, predictions)
print(cm)
scikit-learn 的分数和卷积矩阵
0.9533333333333334
[[37 0 0 0 0 0 0 0 0 0]
[ 0 39 0 0 0 0 2 0 2 0]
[ 0 0 41 3 0 0 0 0 0 0]
[ 0 0 1 43 0 0 0 0 0 1]
[ 0 0 0 0 38 0 0 0 0 0]
[ 0 1 0 0 0 47 0 0 0 0]
[ 0 0 0 0 0 0 52 0 0 0]
[ 0 1 0 1 1 0 0 45 0 0]
[ 0 3 1 0 0 0 0 0 43 1]
[ 0 0 0 1 0 1 0 0 1 44]]
dask-ml 的得分和卷积矩阵
0.09555555555555556
[[ 0 37 0 0 0 0 0 0 0 0]
[ 0 43 0 0 0 0 0 0 0 0]
[ 0 44 0 0 0 0 0 0 0 0]
[ 0 45 0 0 0 0 0 0 0 0]
[ 0 38 0 0 0 0 0 0 0 0]
[ 0 48 0 0 0 0 0 0 0 0]
[ 0 52 0 0 0 0 0 0 0 0]
[ 0 48 0 0 0 0 0 0 0 0]
[ 0 48 0 0 0 0 0 0 0 0]
[ 0 47 0 0 0 0 0 0 0 0]]
从版本 dask_ml==1.0.0
开始,Dask-ml 不支持具有多个 类 的逻辑回归。使用原始示例的略微修改版本,如果您从适合的 dask-ml LogisticRegression
分类器打印 predictions
,您会看到它给出了一个填充有 True
的布尔数组。
from sklearn.datasets import load_digits
from dask_ml.linear_model import LogisticRegression
X, y = load_digits(return_X_y=True)
lr = LogisticRegression(solver_kwargs={"normalize": False})
lr.fit(X, y)
predictions = lr.predict(X)
print('predictions = {}'.format(predictions))
产出
predictions = [ True True True ... True True True]
这就是 dask-ml 和 scikit-learn 混淆矩阵彼此不同的原因。
上有一个相关的未解决问题