在不使用额外的库或方法的情况下将 pandas 个数据框拟合到 Scikit-Learn 的模型
Fitting pandas Data Frames to Scikit-Learn’s model without using additional libraries or methods
一方面,people say pandas goes along great with scikit-learn. For example, pandas series objects fit well with sklearn models in this video。另一方面,sklearn-pandas 在 Scikit-Learn 的机器学习方法和 pandas 风格的数据框架之间架起了一座桥梁,这意味着需要对于这样的图书馆。此外,例如,有些人将 pandas 数据帧转换为 numpy 数组以拟合模型。
我想知道是否可以将 pandas 和 scikit-learn 结合起来而不需要任何额外的方法和库。我的问题是,每当我通过以下方式将我的数据集拟合到 sklearn 模型时:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
d = {'x': np.linspace(1., 100., 20), 'y': np.linspace(1., 10., 20)}
df = pd.DataFrame(d)
train, test = train_test_split(df, test_size = 0.2)
trainX = train['x']
trainY = train['y']
lin_svm = SVC(kernel='linear').fit(trainX, trainY)
我收到一个错误:
ValueError: Unknown label type: 19 10.000000
0 1.000000
17 9.052632
18 9.526316
12 6.684211
11 6.210526
16 8.578947
14 7.631579
10 5.736842
7 4.315789
8 4.789474
2 1.947368
13 7.157895
1 1.473684
6 3.842105
3 2.421053
Name: y, dtype: float64
据我了解,这是因为数据结构。但是网上很少有使用类似代码没有任何问题的例子。
您可能想要做的是回归而不是分类。
想一想,要进行分类,您需要 binary 输出或 multiclass one。在您的情况下,您将 continuous 数据提供给 classifier.
如果您追溯您的错误并深入研究 sklearn
方法 .fit()
的实现,您会发现以下函数:
def check_classification_targets(y):
"""Ensure that target y is of a non-regression type.
Only the following target types (as defined in type_of_target) are allowed:
'binary', 'multiclass', 'multiclass-multioutput',
'multilabel-indicator', 'multilabel-sequences'
Parameters
----------
y : array-like
"""
y_type = type_of_target(y)
if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
'multilabel-indicator', 'multilabel-sequences']:
raise ValueError("Unknown label type: %r" % y)
函数 type_of_target
的文档字符串是:
def type_of_target(y):
"""Determine the type of data indicated by target `y`
Parameters
----------
y : array-like
Returns
-------
target_type : string
One of:
* 'continuous': `y` is an array-like of floats that are not all
integers, and is 1d or a column vector.
* 'continuous-multioutput': `y` is a 2d array of floats that are
not all integers, and both dimensions are of size > 1.
* 'binary': `y` contains <= 2 discrete values and is 1d or a column
vector.
* 'multiclass': `y` contains more than two discrete values, is not a
sequence of sequences, and is 1d or a column vector.
* 'multiclass-multioutput': `y` is a 2d array that contains more
than two discrete values, is not a sequence of sequences, and both
dimensions are of size > 1.
* 'multilabel-indicator': `y` is a label indicator matrix, an array
of two dimensions with at least two columns, and at most 2 unique
values.
* 'unknown': `y` is array-like but none of the above, such as a 3d
array, sequence of sequences, or an array of non-sequence objects.
在你的情况下 type_of_target(trainY)=='continuous' and then it raises a
ValueErrorin the function
check_classification_targets()`.
结论:
- 如果您想执行分类,请更改您的目标
y
。 (例如,使用二元向量)
- 如果您想保持连续数据,请执行回归。使用
svm.SVR
.
一方面,people say pandas goes along great with scikit-learn. For example, pandas series objects fit well with sklearn models in this video。另一方面,sklearn-pandas 在 Scikit-Learn 的机器学习方法和 pandas 风格的数据框架之间架起了一座桥梁,这意味着需要对于这样的图书馆。此外,例如,有些人将 pandas 数据帧转换为 numpy 数组以拟合模型。
我想知道是否可以将 pandas 和 scikit-learn 结合起来而不需要任何额外的方法和库。我的问题是,每当我通过以下方式将我的数据集拟合到 sklearn 模型时:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
d = {'x': np.linspace(1., 100., 20), 'y': np.linspace(1., 10., 20)}
df = pd.DataFrame(d)
train, test = train_test_split(df, test_size = 0.2)
trainX = train['x']
trainY = train['y']
lin_svm = SVC(kernel='linear').fit(trainX, trainY)
我收到一个错误:
ValueError: Unknown label type: 19 10.000000
0 1.000000
17 9.052632
18 9.526316
12 6.684211
11 6.210526
16 8.578947
14 7.631579
10 5.736842
7 4.315789
8 4.789474
2 1.947368
13 7.157895
1 1.473684
6 3.842105
3 2.421053
Name: y, dtype: float64
据我了解,这是因为数据结构。但是网上很少有使用类似代码没有任何问题的例子。
您可能想要做的是回归而不是分类。
想一想,要进行分类,您需要 binary 输出或 multiclass one。在您的情况下,您将 continuous 数据提供给 classifier.
如果您追溯您的错误并深入研究 sklearn
方法 .fit()
的实现,您会发现以下函数:
def check_classification_targets(y):
"""Ensure that target y is of a non-regression type.
Only the following target types (as defined in type_of_target) are allowed:
'binary', 'multiclass', 'multiclass-multioutput',
'multilabel-indicator', 'multilabel-sequences'
Parameters
----------
y : array-like
"""
y_type = type_of_target(y)
if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
'multilabel-indicator', 'multilabel-sequences']:
raise ValueError("Unknown label type: %r" % y)
函数 type_of_target
的文档字符串是:
def type_of_target(y):
"""Determine the type of data indicated by target `y`
Parameters
----------
y : array-like
Returns
-------
target_type : string
One of:
* 'continuous': `y` is an array-like of floats that are not all
integers, and is 1d or a column vector.
* 'continuous-multioutput': `y` is a 2d array of floats that are
not all integers, and both dimensions are of size > 1.
* 'binary': `y` contains <= 2 discrete values and is 1d or a column
vector.
* 'multiclass': `y` contains more than two discrete values, is not a
sequence of sequences, and is 1d or a column vector.
* 'multiclass-multioutput': `y` is a 2d array that contains more
than two discrete values, is not a sequence of sequences, and both
dimensions are of size > 1.
* 'multilabel-indicator': `y` is a label indicator matrix, an array
of two dimensions with at least two columns, and at most 2 unique
values.
* 'unknown': `y` is array-like but none of the above, such as a 3d
array, sequence of sequences, or an array of non-sequence objects.
在你的情况下 type_of_target(trainY)=='continuous' and then it raises a
ValueErrorin the function
check_classification_targets()`.
结论:
- 如果您想执行分类,请更改您的目标
y
。 (例如,使用二元向量) - 如果您想保持连续数据,请执行回归。使用
svm.SVR
.