指定分类特征列的形状?
Specify shape for categorical feature columns?
我知道我可以使用 categorical_column_with_identity
将分类特征转换为一系列单热特征。
例如,如果我的词汇量是 ["ON", "OFF", "UNKNOWN"]
:
"OFF"
-> [0, 1, 0]
categorical_column = tf.feature_column.categorical_column_with_identity('column_name', num_buckets=3)
feature_column = tf.feature_column.indicator_column(categorical_column))
但是,我实际上有一个一维的分类特征数组。我想把它变成一个二维系列的单热特征:
["OFF", "ON", "OFF", "UNKNOWN", "ON"]
->
[[0, 1, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]]
与其他所有功能专栏不同,categorical_column_with_identity
上似乎没有 shape
属性,而且我没有通过 Google 或文档找到任何帮助。
我是否必须放弃 categorical_column_with_identity
并通过 numerical_column
自己创建二维数组?
根据评论,我不确定 tensorflow
是否可以实现此功能。但是使用 Pandas 你有一个简单的解决方案 pd.get_dummies
:
import pandas as pd
L = ['OFF', 'ON', 'OFF', 'UNKNOWN', 'ON']
res = pd.get_dummies(L)
print(res)
OFF ON UNKNOWN
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
为了性能,或者如果你只需要一个 NumPy 数组,你可以使用 LabelBinarizer
from sklearn.preprocessing
:
from sklearn.preprocessing import LabelBinarizer
LB = LabelBinarizer()
res = LB.fit_transform(L)
print(res)
array([[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0]])
二进制编码的几个选项
import tensorflow as tf
test = ["OFF", "ON", "OFF", "UNKNOWN", "ON"]
encoding = {x:idx for idx, x in enumerate(sorted(set(test)))}
test = [encoding[x] for x in test]
print(tf.keras.utils.to_categorical(test, num_classes=len(encoding)))
>>>[[1. 0. 0.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]
或如其他答案所述来自 scikit
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
transfomed_label = encoder.fit_transform(["OFF", "ON", "OFF", "UNKNOWN", "ON"])
print(transfomed_label)
>>>[[1 0 0]
[0 1 0]
[1 0 0]
[0 0 1]
[0 1 0]]
您可以像这样使用字典作为地图:
categorical_features = ["OFF", "ON", "OFF", "UNKNOWN", "ON"]
one_hot_features = []
map = {"ON": [1, 0, 0], "OFF": [0, 1, 0], "UNKNOWN": [0, 0, 1]}
for val in categorical_features:
one_hot_features.append(map[val])
或使用列表理解:
categorical_features = ["OFF", "ON", "OFF", "UNKNOWN", "ON"]
map = {"ON": [1, 0, 0], "OFF": [0, 1, 0], "UNKNOWN": [0, 0, 1]}
one_hot_features = [map[f] for f in categorical_features]
这应该能满足您的需求。
我知道我可以使用 categorical_column_with_identity
将分类特征转换为一系列单热特征。
例如,如果我的词汇量是 ["ON", "OFF", "UNKNOWN"]
:
"OFF"
-> [0, 1, 0]
categorical_column = tf.feature_column.categorical_column_with_identity('column_name', num_buckets=3)
feature_column = tf.feature_column.indicator_column(categorical_column))
但是,我实际上有一个一维的分类特征数组。我想把它变成一个二维系列的单热特征:
["OFF", "ON", "OFF", "UNKNOWN", "ON"]
->
[[0, 1, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]]
与其他所有功能专栏不同,categorical_column_with_identity
上似乎没有 shape
属性,而且我没有通过 Google 或文档找到任何帮助。
我是否必须放弃 categorical_column_with_identity
并通过 numerical_column
自己创建二维数组?
根据评论,我不确定 tensorflow
是否可以实现此功能。但是使用 Pandas 你有一个简单的解决方案 pd.get_dummies
:
import pandas as pd
L = ['OFF', 'ON', 'OFF', 'UNKNOWN', 'ON']
res = pd.get_dummies(L)
print(res)
OFF ON UNKNOWN
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
为了性能,或者如果你只需要一个 NumPy 数组,你可以使用 LabelBinarizer
from sklearn.preprocessing
:
from sklearn.preprocessing import LabelBinarizer
LB = LabelBinarizer()
res = LB.fit_transform(L)
print(res)
array([[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0]])
二进制编码的几个选项
import tensorflow as tf
test = ["OFF", "ON", "OFF", "UNKNOWN", "ON"]
encoding = {x:idx for idx, x in enumerate(sorted(set(test)))}
test = [encoding[x] for x in test]
print(tf.keras.utils.to_categorical(test, num_classes=len(encoding)))
>>>[[1. 0. 0.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]
或如其他答案所述来自 scikit
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
transfomed_label = encoder.fit_transform(["OFF", "ON", "OFF", "UNKNOWN", "ON"])
print(transfomed_label)
>>>[[1 0 0]
[0 1 0]
[1 0 0]
[0 0 1]
[0 1 0]]
您可以像这样使用字典作为地图:
categorical_features = ["OFF", "ON", "OFF", "UNKNOWN", "ON"]
one_hot_features = []
map = {"ON": [1, 0, 0], "OFF": [0, 1, 0], "UNKNOWN": [0, 0, 1]}
for val in categorical_features:
one_hot_features.append(map[val])
或使用列表理解: categorical_features = ["OFF", "ON", "OFF", "UNKNOWN", "ON"]
map = {"ON": [1, 0, 0], "OFF": [0, 1, 0], "UNKNOWN": [0, 0, 1]}
one_hot_features = [map[f] for f in categorical_features]
这应该能满足您的需求。