如何仅规范化 scikit-learn 中的某些列?
How to normalize only certain columns in scikit-learn?
我有类似下面的数据:
[
[0, 4, 15]
[0, 3, 7]
[1, 5, 9]
[2, 4, 15]
]
我用 oneHotEncoder
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.fit_transform 预处理了这些数据,所以它适合线性回归给我这个:
[
[1, 0, 0, 4, 15]
[1, 0, 0, 3, 7]
[0, 1, 0, 5, 9]
[0, 0, 1, 4, 15]
]
但是,我希望将此数据标准化。
到目前为止,我只是像这样规范化数据:
preprocessing.normalize(data)
但是,这会规范化所有列,包括类别列。
我的问题如下:
- 如何仅规范化某些列?
- 规范化类别数据是否可取,还是应该避免?
谢谢!
使用numpy
to pass a slice of your data to normalize
. As for your question about normalizing category data, you will probably get a better answer to that question on CrossValidated.
第一个问题的示例:
In [1]: import numpy as np
from sklearn.preprocessing import normalize
# Values as floats or normalize raises a type error
X1 = np.array([
[1., 0., 0., 4., 15.],
[1., 0., 0., 3., 7.],
[0., 1., 0., 5., 9.],
[0., 0., 1., 4., 15.],
])
In [2]: X1[:, [3,4]] # last two columns
Out[2]: array([[ 4., 15.],
[ 3., 7.],
[ 5., 9.],
[ 4., 15.]])
规范化最后两列并分配给一个新的 numpy 数组,X2
。
In [3]: X2 = normalize(X1[:, [3,4]], axis=0) #axis=0 for column-wise
X2
Out[3]: array([[ 0.49236596, 0.6228411 ],
[ 0.36927447, 0.29065918],
[ 0.61545745, 0.37370466],
[ 0.49236596, 0.6228411 ]])
现在连接 X1
和 X2
以获得您想要的输出。
In [4]: np.concatenate(( X1[:,[0,1,2]], X2), axis=1)
Out[4]: array([[ 1. , 0. , 0. , 0.49236596, 0.6228411 ],
[ 1. , 0. , 0. , 0.36927447, 0.29065918],
[ 0. , 1. , 0. , 0.61545745, 0.37370466],
[ 0. , 0. , 1. , 0.49236596, 0.6228411 ]])
如果您正在使用 pandas.DataFrame
,您可能需要检查 sklearn-pandas。
您可以使用 sklearn.preprocessing.MinMaxScaler 函数将数据缩放到最大值 = 1 和最小值 = 0。您可以在此处查看文档 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler
我有类似下面的数据:
[
[0, 4, 15]
[0, 3, 7]
[1, 5, 9]
[2, 4, 15]
]
我用 oneHotEncoder
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.fit_transform 预处理了这些数据,所以它适合线性回归给我这个:
[
[1, 0, 0, 4, 15]
[1, 0, 0, 3, 7]
[0, 1, 0, 5, 9]
[0, 0, 1, 4, 15]
]
但是,我希望将此数据标准化。
到目前为止,我只是像这样规范化数据:
preprocessing.normalize(data)
但是,这会规范化所有列,包括类别列。
我的问题如下:
- 如何仅规范化某些列?
- 规范化类别数据是否可取,还是应该避免?
谢谢!
使用numpy
to pass a slice of your data to normalize
. As for your question about normalizing category data, you will probably get a better answer to that question on CrossValidated.
第一个问题的示例:
In [1]: import numpy as np
from sklearn.preprocessing import normalize
# Values as floats or normalize raises a type error
X1 = np.array([
[1., 0., 0., 4., 15.],
[1., 0., 0., 3., 7.],
[0., 1., 0., 5., 9.],
[0., 0., 1., 4., 15.],
])
In [2]: X1[:, [3,4]] # last two columns
Out[2]: array([[ 4., 15.],
[ 3., 7.],
[ 5., 9.],
[ 4., 15.]])
规范化最后两列并分配给一个新的 numpy 数组,X2
。
In [3]: X2 = normalize(X1[:, [3,4]], axis=0) #axis=0 for column-wise
X2
Out[3]: array([[ 0.49236596, 0.6228411 ],
[ 0.36927447, 0.29065918],
[ 0.61545745, 0.37370466],
[ 0.49236596, 0.6228411 ]])
现在连接 X1
和 X2
以获得您想要的输出。
In [4]: np.concatenate(( X1[:,[0,1,2]], X2), axis=1)
Out[4]: array([[ 1. , 0. , 0. , 0.49236596, 0.6228411 ],
[ 1. , 0. , 0. , 0.36927447, 0.29065918],
[ 0. , 1. , 0. , 0.61545745, 0.37370466],
[ 0. , 0. , 1. , 0.49236596, 0.6228411 ]])
如果您正在使用 pandas.DataFrame
,您可能需要检查 sklearn-pandas。
您可以使用 sklearn.preprocessing.MinMaxScaler 函数将数据缩放到最大值 = 1 和最小值 = 0。您可以在此处查看文档 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler