为什么 post-padding 比 pre-padding 训练得更快?
Why does post-padding train faster than pre-padding?
我一直在做一些 NLP 分类任务,发现如果我使用 post-padding 而不是预填充,我的模型训练得更快,我想知道为什么会这样。
我正在使用 Google Colab 通过 GPU 运行时来训练这些模型。这是我的预处理代码:
PADDING = 'post'
# Tokenising the input strings and padding
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X)
X_tokenized = tokenizer.texts_to_sequences(X)
X_padded = pad_sequences(X_tokenized, maxlen=80, truncating='post', padding=PADDING)
X_train = np.array(X_padded)
# Encoding output one
y1 = y1.to_numpy().reshape(-1, 1) # Reshape to an array of features
encoder_1 = OneHotEncoder() # Instantiate encoder
y1 = encoder_1.fit_transform(y1) # Fit encoder to output
y1 = y1.toarray() # Make output a numpy array
# Encoding output two
y2 = y2.to_numpy().reshape(-1, 1)
encoder_2 = OneHotEncoder()
y2 = form_encoder.fit_transform(y2)
y2 = y2.toarray()
现在创建我的模型:
# --- MODEL PARAMETERS ---
vocab_size = len(tokenizer.index_word) + 1
y1_size = len(encoder_1.categories_[0])
y2_size = len(encoder_2.categories_[0])
embedding_size = 175
units = 96
# --- MODEL ARCHITECTURE ---
inputs = Input(shape=(None,))
input_embeddings = Embedding(vocab_size, embedding_size, mask_zero=True)(inputs)
shared_lstm = Bidirectional(LSTM(units, return_sequences=True,
dropout=0.3))(input_embeddings)
y1_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y1_dense = Dense(y1_size, activation='softmax', name='y1')(y1_lstm)
y2_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y2_dense = Dense(y2_size, activation='softmax', name='y2')(y2_lstm)
split_shared_model = Model(inputs=inputs, outputs=[y1_dense, y2_dense])
然后编译为:
split_shared_model.compile(
optimizer='adam',
loss=CategoricalCrossentropy(),
metrics=['accuracy']
)
模型总结如下:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_4 (InputLayer) [(None, None)] 0 []
embedding_3 (Embedding) (None, None, 175) 19075 ['input_4[0][0]']
bidirectional_8 (Bidirectional (None, None, 192) 208896 ['embedding_3[0][0]']
)
bidirectional_9 (Bidirectional (None, 192) 221952 ['bidirectional_8[0][0]']
)
bidirectional_10 (Bidirectiona (None, 192) 221952 ['bidirectional_8[0][0]']
l)
y1 (Dense) (None, 912) 176016 ['bidirectional_9[0][0]']
y2 (Dense) (None, 617) 119081 ['bidirectional_10[0][0]']
==================================================================================================
Total params: 966,972
Trainable params: 966,972
Non-trainable params: 0
__________________________________________________________________________________________________
调用fit()
方法后模型开始训练。下面是使用上述设置的中间结果:
Epoch 1/50
398/2647 [===>..........................] - ETA: 1:28 - loss: 8.7918 - y1_loss: 4.9236 - y2_loss: 3.8682 - y1_accuracy: 0.1495 - y2_accuracy: 0.3204
---------------------------------------------------------------------------
但是,如果我将 PADDING
更改为 'pre'
,我发现训练速度要慢得多!
Epoch 1/50
90/2647 [>.............................] - ETA: 45:52 - loss: 9.8153 - y1_loss: 5.3961 - y2_loss: 4.4192 - y1_accuracy: 0.1243 - y2_accuracy: 0.2788
谁能解释这是为什么?我认为它可能与嵌入层有关并且它正在屏蔽但我不确定。
这与底层 LSTM
实现有关。实际上有两个:“原生 Tensorflow”和高度优化的纯 CUDA 实现,速度快得多。但是,后者只能在特定条件下使用(某些参数设置等)。您可以在 the docs 中找到详细信息。这里的要点是:
Inputs, if use masking, are strictly right-padded.
这意味着 pre-padding 版本没有使用有效的实现,这解释了运行时间慢得多。我不认为这里有合理的解决方法,除了坚持使用 post-padding.
请注意,有时 Tensorflow 实际上会输出一条警告消息,指出它必须使用低效的实现。然而,对我来说,这是不一致的。如果在 pre-padding 案例中产生任何额外的警告输出,请留意。
我一直在做一些 NLP 分类任务,发现如果我使用 post-padding 而不是预填充,我的模型训练得更快,我想知道为什么会这样。
我正在使用 Google Colab 通过 GPU 运行时来训练这些模型。这是我的预处理代码:
PADDING = 'post'
# Tokenising the input strings and padding
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X)
X_tokenized = tokenizer.texts_to_sequences(X)
X_padded = pad_sequences(X_tokenized, maxlen=80, truncating='post', padding=PADDING)
X_train = np.array(X_padded)
# Encoding output one
y1 = y1.to_numpy().reshape(-1, 1) # Reshape to an array of features
encoder_1 = OneHotEncoder() # Instantiate encoder
y1 = encoder_1.fit_transform(y1) # Fit encoder to output
y1 = y1.toarray() # Make output a numpy array
# Encoding output two
y2 = y2.to_numpy().reshape(-1, 1)
encoder_2 = OneHotEncoder()
y2 = form_encoder.fit_transform(y2)
y2 = y2.toarray()
现在创建我的模型:
# --- MODEL PARAMETERS ---
vocab_size = len(tokenizer.index_word) + 1
y1_size = len(encoder_1.categories_[0])
y2_size = len(encoder_2.categories_[0])
embedding_size = 175
units = 96
# --- MODEL ARCHITECTURE ---
inputs = Input(shape=(None,))
input_embeddings = Embedding(vocab_size, embedding_size, mask_zero=True)(inputs)
shared_lstm = Bidirectional(LSTM(units, return_sequences=True,
dropout=0.3))(input_embeddings)
y1_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y1_dense = Dense(y1_size, activation='softmax', name='y1')(y1_lstm)
y2_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y2_dense = Dense(y2_size, activation='softmax', name='y2')(y2_lstm)
split_shared_model = Model(inputs=inputs, outputs=[y1_dense, y2_dense])
然后编译为:
split_shared_model.compile(
optimizer='adam',
loss=CategoricalCrossentropy(),
metrics=['accuracy']
)
模型总结如下:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_4 (InputLayer) [(None, None)] 0 []
embedding_3 (Embedding) (None, None, 175) 19075 ['input_4[0][0]']
bidirectional_8 (Bidirectional (None, None, 192) 208896 ['embedding_3[0][0]']
)
bidirectional_9 (Bidirectional (None, 192) 221952 ['bidirectional_8[0][0]']
)
bidirectional_10 (Bidirectiona (None, 192) 221952 ['bidirectional_8[0][0]']
l)
y1 (Dense) (None, 912) 176016 ['bidirectional_9[0][0]']
y2 (Dense) (None, 617) 119081 ['bidirectional_10[0][0]']
==================================================================================================
Total params: 966,972
Trainable params: 966,972
Non-trainable params: 0
__________________________________________________________________________________________________
调用fit()
方法后模型开始训练。下面是使用上述设置的中间结果:
Epoch 1/50
398/2647 [===>..........................] - ETA: 1:28 - loss: 8.7918 - y1_loss: 4.9236 - y2_loss: 3.8682 - y1_accuracy: 0.1495 - y2_accuracy: 0.3204
---------------------------------------------------------------------------
但是,如果我将 PADDING
更改为 'pre'
,我发现训练速度要慢得多!
Epoch 1/50
90/2647 [>.............................] - ETA: 45:52 - loss: 9.8153 - y1_loss: 5.3961 - y2_loss: 4.4192 - y1_accuracy: 0.1243 - y2_accuracy: 0.2788
谁能解释这是为什么?我认为它可能与嵌入层有关并且它正在屏蔽但我不确定。
这与底层 LSTM
实现有关。实际上有两个:“原生 Tensorflow”和高度优化的纯 CUDA 实现,速度快得多。但是,后者只能在特定条件下使用(某些参数设置等)。您可以在 the docs 中找到详细信息。这里的要点是:
Inputs, if use masking, are strictly right-padded.
这意味着 pre-padding 版本没有使用有效的实现,这解释了运行时间慢得多。我不认为这里有合理的解决方法,除了坚持使用 post-padding.
请注意,有时 Tensorflow 实际上会输出一条警告消息,指出它必须使用低效的实现。然而,对我来说,这是不一致的。如果在 pre-padding 案例中产生任何额外的警告输出,请留意。