为什么 post-padding 比 pre-padding 训练得更快？

Question

我一直在做一些 NLP 分类任务，发现如果我使用 post-padding 而不是预填充，我的模型训练得更快，我想知道为什么会这样。

我正在使用 Google Colab 通过 GPU 运行时来训练这些模型。这是我的预处理代码：

PADDING = 'post'

# Tokenising the input strings and padding

tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X)
X_tokenized = tokenizer.texts_to_sequences(X)
X_padded = pad_sequences(X_tokenized, maxlen=80, truncating='post', padding=PADDING)
X_train = np.array(X_padded)

# Encoding output one

y1 = y1.to_numpy().reshape(-1, 1)   # Reshape to an array of features
encoder_1 = OneHotEncoder()         # Instantiate encoder
y1 = encoder_1.fit_transform(y1)    # Fit encoder to output 
y1 = y1.toarray()                   # Make output a numpy array

# Encoding output two
    
y2 = y2.to_numpy().reshape(-1, 1)
encoder_2 = OneHotEncoder()
y2 = form_encoder.fit_transform(y2)
y2 = y2.toarray()

现在创建我的模型：

# --- MODEL PARAMETERS ---

vocab_size = len(tokenizer.index_word) + 1
y1_size = len(encoder_1.categories_[0])
y2_size = len(encoder_2.categories_[0])

embedding_size = 175
units = 96

# --- MODEL ARCHITECTURE ---

inputs = Input(shape=(None,))
input_embeddings = Embedding(vocab_size, embedding_size, mask_zero=True)(inputs)

shared_lstm = Bidirectional(LSTM(units, return_sequences=True, 
                                 dropout=0.3))(input_embeddings)

y1_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y1_dense = Dense(y1_size, activation='softmax', name='y1')(y1_lstm)

y2_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y2_dense = Dense(y2_size, activation='softmax', name='y2')(y2_lstm)

split_shared_model = Model(inputs=inputs, outputs=[y1_dense, y2_dense])

然后编译为：

split_shared_model.compile(
    optimizer='adam', 
    loss=CategoricalCrossentropy(), 
    metrics=['accuracy']
    )

模型总结如下：

__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_4 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_3 (Embedding)        (None, None, 175)    19075       ['input_4[0][0]']                
                                                                                                  
 bidirectional_8 (Bidirectional  (None, None, 192)   208896      ['embedding_3[0][0]']            
 )                                                                                                
                                                                                                  
 bidirectional_9 (Bidirectional  (None, 192)         221952      ['bidirectional_8[0][0]']        
 )                                                                                                
                                                                                                  
 bidirectional_10 (Bidirectiona  (None, 192)         221952      ['bidirectional_8[0][0]']        
 l)                                                                                               
                                                                                                  
 y1 (Dense)                     (None, 912)          176016      ['bidirectional_9[0][0]']        
                                                                                                  
 y2 (Dense)                     (None, 617)          119081      ['bidirectional_10[0][0]']       
                                                                                                  
==================================================================================================
Total params: 966,972
Trainable params: 966,972
Non-trainable params: 0
__________________________________________________________________________________________________

调用fit()方法后模型开始训练。下面是使用上述设置的中间结果：

Epoch 1/50
 398/2647 [===>..........................] - ETA: 1:28 - loss: 8.7918 - y1_loss: 4.9236 - y2_loss: 3.8682 - y1_accuracy: 0.1495 - y2_accuracy: 0.3204
---------------------------------------------------------------------------

但是，如果我将 PADDING 更改为 'pre'，我发现训练速度要慢得多！

Epoch 1/50
  90/2647 [>.............................] - ETA: 45:52 - loss: 9.8153 - y1_loss: 5.3961 - y2_loss: 4.4192 - y1_accuracy: 0.1243 - y2_accuracy: 0.2788

谁能解释这是为什么？我认为它可能与嵌入层有关并且它正在屏蔽但我不确定。

Answer 1

这与底层 LSTM 实现有关。实际上有两个：“原生 Tensorflow”和高度优化的纯 CUDA 实现，速度快得多。但是，后者只能在特定条件下使用（某些参数设置等）。您可以在 the docs 中找到详细信息。这里的要点是：

Inputs, if use masking, are strictly right-padded.

这意味着 pre-padding 版本没有使用有效的实现，这解释了运行时间慢得多。我不认为这里有合理的解决方法，除了坚持使用 post-padding.

请注意，有时 Tensorflow 实际上会输出一条警告消息，指出它必须使用低效的实现。然而，对我来说，这是不一致的。如果在 pre-padding 案例中产生任何额外的警告输出，请留意。

为什么 post-padding 比 pre-padding 训练得更快？

Why does post-padding train faster than pre-padding?

python

nlp

padding

keras

tensorflow