为什么 post-padding 比 pre-padding 训练得更快?

Why does post-padding train faster than pre-padding?

我一直在做一些 NLP 分类任务,发现如果我使用 post-padding 而不是预填充,我的模型训练得更快,我想知道为什么会这样。

我正在使用 Google Colab 通过 GPU 运行时来训练这些模型。这是我的预处理代码:

PADDING = 'post'

# Tokenising the input strings and padding

tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X)
X_tokenized = tokenizer.texts_to_sequences(X)
X_padded = pad_sequences(X_tokenized, maxlen=80, truncating='post', padding=PADDING)
X_train = np.array(X_padded)

# Encoding output one

y1 = y1.to_numpy().reshape(-1, 1)   # Reshape to an array of features
encoder_1 = OneHotEncoder()         # Instantiate encoder
y1 = encoder_1.fit_transform(y1)    # Fit encoder to output 
y1 = y1.toarray()                   # Make output a numpy array

# Encoding output two
    
y2 = y2.to_numpy().reshape(-1, 1)
encoder_2 = OneHotEncoder()
y2 = form_encoder.fit_transform(y2)
y2 = y2.toarray()

现在创建我的模型:

# --- MODEL PARAMETERS ---

vocab_size = len(tokenizer.index_word) + 1
y1_size = len(encoder_1.categories_[0])
y2_size = len(encoder_2.categories_[0])

embedding_size = 175
units = 96

# --- MODEL ARCHITECTURE ---

inputs = Input(shape=(None,))
input_embeddings = Embedding(vocab_size, embedding_size, mask_zero=True)(inputs)

shared_lstm = Bidirectional(LSTM(units, return_sequences=True, 
                                 dropout=0.3))(input_embeddings)

y1_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y1_dense = Dense(y1_size, activation='softmax', name='y1')(y1_lstm)

y2_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y2_dense = Dense(y2_size, activation='softmax', name='y2')(y2_lstm)

split_shared_model = Model(inputs=inputs, outputs=[y1_dense, y2_dense])

然后编译为:

split_shared_model.compile(
    optimizer='adam', 
    loss=CategoricalCrossentropy(), 
    metrics=['accuracy']
    )

模型总结如下:

__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_4 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_3 (Embedding)        (None, None, 175)    19075       ['input_4[0][0]']                
                                                                                                  
 bidirectional_8 (Bidirectional  (None, None, 192)   208896      ['embedding_3[0][0]']            
 )                                                                                                
                                                                                                  
 bidirectional_9 (Bidirectional  (None, 192)         221952      ['bidirectional_8[0][0]']        
 )                                                                                                
                                                                                                  
 bidirectional_10 (Bidirectiona  (None, 192)         221952      ['bidirectional_8[0][0]']        
 l)                                                                                               
                                                                                                  
 y1 (Dense)                     (None, 912)          176016      ['bidirectional_9[0][0]']        
                                                                                                  
 y2 (Dense)                     (None, 617)          119081      ['bidirectional_10[0][0]']       
                                                                                                  
==================================================================================================
Total params: 966,972
Trainable params: 966,972
Non-trainable params: 0
__________________________________________________________________________________________________

调用fit()方法后模型开始训练。下面是使用上述设置的中间结果:

Epoch 1/50
 398/2647 [===>..........................] - ETA: 1:28 - loss: 8.7918 - y1_loss: 4.9236 - y2_loss: 3.8682 - y1_accuracy: 0.1495 - y2_accuracy: 0.3204
---------------------------------------------------------------------------

但是,如果我将 PADDING 更改为 'pre',我发现训练速度要慢得多!

Epoch 1/50
  90/2647 [>.............................] - ETA: 45:52 - loss: 9.8153 - y1_loss: 5.3961 - y2_loss: 4.4192 - y1_accuracy: 0.1243 - y2_accuracy: 0.2788

谁能解释这是为什么?我认为它可能与嵌入层有关并且它正在屏蔽但我不确定。

这与底层 LSTM 实现有关。实际上有两个:“原生 Tensorflow”和高度优化的纯 CUDA 实现,速度快得多。但是,后者只能在特定条件下使用(某些参数设置等)。您可以在 the docs 中找到详细信息。这里的要点是:

Inputs, if use masking, are strictly right-padded.

这意味着 pre-padding 版本没有使用有效的实现,这解释了运行时间慢得多。我不认为这里有合理的解决方法,除了坚持使用 post-padding.

请注意,有时 Tensorflow 实际上会输出一条警告消息,指出它必须使用低效的实现。然而,对我来说,这是不一致的。如果在 pre-padding 案例中产生任何额外的警告输出,请留意。