如何使用来自 Keras ANN 的学习嵌入层作为 XGBoost 模型中的输入特征?

How to use a learned embedding layer from a Keras ANN as an input feature in an XGBoost model?

我试图通过从神经网络中提取嵌入层并将其用作单独的 XGBoost 模型中的输入特征来降低分类特征的维数。

嵌入层的维度为(nr.unique categories + 1, chosen output size)。如何将其连接到原始训练数据中维度为(nr.observations,nr.features)的连续变量?

下面是一个使用神经网络进行回归的可重现示例,其中分类特征被编码为学习嵌入层。该示例紧密改编自: http://machinelearningmechanic.com/keras/2018/03/09/keras-regression-with-categorical-variable-embeddings-md.html#Define-the-input-layers

最后我打印了嵌入层及其形状。该层如何与原始训练数据(X_train_continuous)中的连续特征合并?如果行数等于类别数,并且如果我们知道类别在嵌入层中的表示顺序,则嵌入数组也许可以连接到类别的训练观察,但行数等于类别数 + 1(在代码中:len(values) + 1)。

# Imports and helper functions

import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import keras
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization
from keras.layers import Input, Embedding, Dense
from keras.models import Model
from keras.callbacks import Callback
import matplotlib.pyplot as plt

# Bayesian Methods for Hackers style sheet
plt.style.use('bmh')

np.random.seed(1234567890)


class PeriodicLogger(Callback):
    """
    A helper callback class that only prints the losses once in 'display' epochs
    """

    def __init__(self, display=100):
        self.display = display

    def on_train_begin(self, logs={}):
        self.epochs = 0

    def on_epoch_end(self, batch, logs={}):
        self.epochs += 1
        if self.epochs % self.display == 0:
            print("Epoch: %d - loss: %f - val_loss: %f" % (
            self.epochs, logs['loss'], logs['val_loss']))


periodic_logger_250 = PeriodicLogger(250)

# Define the mapping and a function that computes the house price for each
# example

per_meter_mapping = {
    'Mercaz': 500,
    'Old North': 350,
    'Florentine': 230
}

per_room_additional_price = {
    'Mercaz': 15. * 10 ** 4,
    'Old North': 8. * 10 ** 4,
    'Florentine': 5. * 10 ** 4
}


def house_price_func(row):
    """
    house_price_func is the function f(a,s,n).

    :param row: dict (contains the keys: ['area', 'size', 'n_rooms'])
    :return: float
    """
    area, size, n_rooms = row['area'], row['size'], row['n_rooms']
    return size * per_meter_mapping[area] + n_rooms * \
           per_room_additional_price[area]

# Create toy data

AREAS = ['Mercaz', 'Old North', 'Florentine']


def create_samples(n_samples):
    """
    Helper method that creates dataset DataFrames

    Note that the np.random.choice call only determines the number of rooms and the size of the house
    (the price, which we calculate later, is deterministic)

    :param n_samples: int (number of samples for each area (suburb))
    :return: pd.DataFrame
    """
    samples = []

    for n_rooms in np.random.choice(range(1, 6), n_samples):
        samples += [(area, int(np.random.normal(25, 5)), n_rooms) for area in
                    AREAS]

    return pd.DataFrame(samples, columns=['area', 'size', 'n_rooms'])

# Create the train and validation sets

train = create_samples(n_samples=1000)
val = create_samples(n_samples=100)

# Calculate the prices for each set

train['price'] = train.apply(house_price_func, axis=1)
val['price'] = val.apply(house_price_func, axis=1)

# Define the features and the y vectors

continuous_cols = ['size', 'n_rooms']
categorical_cols = ['area']
y_col = ['price']

X_train_continuous = train[continuous_cols]
X_train_categorical = train[categorical_cols]
y_train = train[y_col]

X_val_continuous = val[continuous_cols]
X_val_categorical = val[categorical_cols]
y_val = val[y_col]

# Normalization

# Normalizing both train and test sets to have 0 mean and std. of 1 using the
# train set mean and std.
# This will give each feature an equal initial importance and speed up the
# training time

train_mean = X_train_continuous.mean(axis=0)
train_std = X_train_continuous.std(axis=0)

X_train_continuous = X_train_continuous - train_mean
X_train_continuous /= train_std

X_val_continuous = X_val_continuous - train_mean
X_val_continuous /= train_std

# Build a model using a categorical variable
# First let's define a helper class for the categorical variable

class EmbeddingMapping():
    """
    Helper class for handling categorical variables

    An instance of this class should be defined for each categorical variable
    we want to use.
    """

    def __init__(self, series):
        # get a list of unique values
        values = series.unique().tolist()

        # Set a dictionary mapping from values to integer value
        # In our example this will be {'Mercaz': 1, 'Old North': 2,
        # 'Florentine': 3}
        self.embedding_dict = {value: int_value + 1 for int_value, value in
                               enumerate(values)}

        # The num_values will be used as the input_dim when defining the
        # embedding layer.
        # It will also be returned for unseen values
        self.num_values = len(values) + 1

    def get_mapping(self, value):
        # If the value was seen in the training set, return its integer mapping
        if value in self.embedding_dict:
            return self.embedding_dict[value]

        # Else, return the same integer for unseen values
        else:
            return self.num_values

# Create an embedding column for the train/validation sets

area_mapping = EmbeddingMapping(X_train_categorical['area'])

X_train_categorical = \
    X_train_categorical.assign(area_mapping=X_train_categorical['area']
                               .apply(area_mapping.get_mapping))
X_val_categorical = \
    X_val_categorical.assign(area_mapping=X_val_categorical['area']
                             .apply(area_mapping.get_mapping))

# Define the input layers

# Define the embedding input
area_input = Input(shape=(1,), dtype='int32')

# Decide to what vector size we want to map our 'area' variable.
# I'll use 1 here because we only have three areas
embeddings_output = 2

# Let’s define the embedding layer and flatten it
area_embedings = Embedding(output_dim=embeddings_output,
                           input_dim=area_mapping.num_values,
                           input_length=1, name="embedding_layer")(area_input)
area_embedings = keras.layers.Reshape((embeddings_output,))(area_embedings)

# Define the continuous variables input (just like before)
continuous_input = Input(shape=(X_train_continuous.shape[1], ))

# Concatenate continuous and embeddings inputs
all_input = keras.layers.concatenate([continuous_input, area_embedings])

# To merge them together we will use Keras Functional API
# Will define a simple model with 2 hidden layers, with 25 neurons each.

# Define the model
units=25
dense1 = Dense(units=units, activation='relu')(all_input)
dense2 = Dense(units, activation='relu')(dense1)
predictions = Dense(1)(dense2)

# Note using the input object 'area_input' not 'area_embeddings'
model = Model(inputs=[continuous_input, area_input], outputs=predictions)

# Lets train the model

epochs = 100  # to train properly, use 10000
model.compile(loss='mse',
              optimizer=keras.optimizers.Adam(lr=.8, beta_1=0.9,
                                              beta_2=0.999, decay=1e-03,
                                              amsgrad=True))

# Note continuous and categorical columns are inserted in the same order as
# defined in all_inputs
history = model.fit([X_train_continuous, X_train_categorical['area_mapping']],
                    y_train, epochs=epochs, batch_size=128, callbacks=[
        periodic_logger_250], verbose=0,
                    validation_data=([X_val_continuous, X_val_categorical[
                        'area_mapping']], y_val))

# Observe the embedding layer

embeddings_output = model.get_layer('embedding_layer').get_weights()[0]

print(f'Embedding layer:\n{embeddings_output}')
print(f'Embedding layer shape: {embeddings_output.shape}')

您可以做的一件事是 运行 您的 'pretrained' 模型,每个层都有一个唯一的名称并保存它

然后,创建您的新模型,使用您想要保留的相同命名层,并使用 Model.load_weights(file_path, by_name=真)

这将让您保留所有想要的图层,然后让您更改所有内容

首先,这个 post 有一个术语问题:“嵌入”是特定输入样本的表示。它是一层输出的向量。 “权重”是层内存储和训练的矩阵。

在 Keras 中,模型 class 是 Layer 的子class。您可以将任何模型用作更大模型中的层。

您可以创建仅包含嵌入层的模型,然后在构建模型的其余部分时将其用作层。训练后,您可以在该“子模型”上调用 .predict() 。此外,您可以将该子模型保存到 json 文件中,稍后重新加载。

这是创建发出内部嵌入的模型的标准技术。

要获得具有形状(nr.样本,选择的输出大小)的嵌入层输出:

intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.get_layer("embedding_layer")
                                 .output)
embedding_output = \
    intermediate_layer_model.predict([X_train_continuous,
                                      X_train_categorical['area_mapping']])

print(embedding_output.shape)  # (3000, 1, 2)

intermediate_output = \
    embedding_output.reshape(embedding_output.shape[0], -1)

print(intermediate_output.shape)  # (3000, 2)