如何使用feature hasher将非数值型离散数据进行转换，以便传递给SVM？

Question

我正在尝试使用 UCI 机器学习存储库中的 CRX 数据集。这个特定的数据集包含一些不是连续变量的特征。因此，我需要先将它们转换成数值，然后才能将它们传递给 SVM。

我最初考虑使用 one-hot 解码器，它采用整数值并将它们转换为矩阵（例如，如果一个特征具有三个可能的值，'red' 'blue' 和 'green'，这将被转换成三个二进制特征：1,0,0 代表 'red'，'0,1,0 代表 'blue' 和 0,0,1 代表 'green'。这将非常适合我的需求，除了它只能处理整数特征。

def get_crx_data(debug=False):

    with open("/Volumes/LocalDataHD/jt306/crx.data", "rU") as infile:
        features_array = []
        reader = csv.reader(infile,dialect=csv.excel_tab)
        for row in reader:
            features_array.append(str(row).translate(None,"[]'").split(","))
        features_array = np.array(features_array)
        print features_array.shape
        print features_array[0]
        labels_array = features_array[:,15]
        features_array = features_array[:,:15]
        print features_array.shape
        print labels_array.shape


        print("FeatureHasher on frequency dicts")

        hasher = FeatureHasher(n_features=44)
        X = hasher.fit_transform(line for line in features_array)

        print X.shape



get_crx_data()

这个returns

Reading CRX data from disk
Traceback (most recent call last):
  File"/Volumes/LocalDataHD/PycharmProjects/FeatureSelectionPython278/Crx2.py", line 38, in <module>

get_crx_data()
  File "/Volumes/LocalDataHD/PycharmProjects/FeatureSelectionPython278/Crx2.py", line 32, in get_crx_data

X = hasher.fit_transform(line for line in features_array)

File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 426, in fit_transform
    return self.fit(X, **fit_params).transform(X)

File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 129, in transform
    _hashing.transform(raw_X, self.n_features, self.dtype)

File "_hashing.pyx", line 44, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:1649)

File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 125, in <genexpr>
    raw_X = (_iteritems(d) for d in raw_X)

File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 15, in _iteritems
    return d.iteritems() if hasattr(d, "iteritems") else d.items()

AttributeError: 'numpy.ndarray' object has no attribute 'items'

(690, 16)
['0' ' 30.83' ' 0' ' u' ' g' ' w' ' v' ' 1.25' ' 1' ' 1' ' 1' ' 0' ' g'
 ' 202' ' 0' ' +']
(690, 15)
(690,)
FeatureHasher on frequency dicts

Process finished with exit code 1


How can I use feature hashing (or an alternative method) to convert this data from classes (some of which are strings, others are discrete numerical values) into data which can be handled by an SVM? I have also looked into using one-hot coding, but that only takes integers as input.

Answer 1

问题是 FeatureHasher 对象期望输入的每一行都具有特定的结构——或者实际上，是三种不同的 possible structures 中的一种。第一种可能性是 feature_name:value 对的字典。第二个是 (feature_name, value) 元组的列表。第三个是 feature_name 的平面列表。在前两种情况下，特征名称映射到矩阵中的列，给定值存储在每一行的这些列中。最后，列表中某个特征的存在与否被隐含地理解为 True 或 False 值。下面是一些简单、具体的例子：

>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
...                                                   non_negative=True,
...                                                   input_type='dict')
>>> X_new = hasher.fit_transform([{'a':1, 'b':2}, {'a':0, 'c':5}])
>>> X_new.toarray()
array([[ 1.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  5.,  0.,  0.]])

这说明了默认模式——如果您不通过 input_type，FeatureHasher 会期望什么，就像在您的原始代码中一样。如您所见，预期输入是字典列表，每个输入样本或数据行对应一个字典。每个字典包含任意数量的特征名称，映射到该行的值。

输出 X_new 包含数组的稀疏表示；调用 toarray() returns 数据的新副本作为原始 numpy 数组。

如果您想传递成对的元组，请传递 input_type='pairs'。那么你可以这样做：

>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
...                                                   non_negative=True,
...                                                   input_type='pair')
>>> X_new = hasher.fit_transform([[('a', 1), ('b', 2)], [('a', 0), ('c', 5)]])
>>> X_new.toarray()
array([[ 1.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  5.,  0.,  0.]])

最后，如果您只有布尔值，则根本不必显式传递值——FeatureHasher 将简单地假设如果存在特征名称，则其值为 True（此处表示为浮点值 1.0）。

>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
...                                                   non_negative=True,
...                                                   input_type='string')
>>> X_new = hasher.fit_transform([['a', 'b'], ['a', 'c']])
>>> X_new.toarray()
array([[ 1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.]])

遗憾的是，您的数据似乎并非始终采用这些格式中的任何一种。但是，修改您必须适应 'dict' 或 'pair' 格式的内容应该不会太困难。如果您需要帮助，请告诉我；在这种情况下，请详细说明您要转换的数据格式。

如何使用feature hasher将非数值型离散数据进行转换，以便传递给SVM？

How to use feature hasher to convert non-numerical discrete data so that it can be passed to SVM?

python

numpy

machine-learning

feature-extraction

scikit-learn