预训练词嵌入 gensim 上的 LSTM 网络

Question

我是深度学习新手。我正在尝试在词嵌入功能上制作非常基本的 LSTM 网络。我已经为模型编写了以下代码，但我无法运行它。

from keras.layers import Dense, LSTM, merge, Input,Concatenate
from keras.layers.recurrent import LSTM
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten


max_sequence_size = 14
classes_num = 2

LSTM_word_1 = LSTM(100, activation='relu',recurrent_dropout = 0.25, dropout = 0.25)
lstm_word_input_1 = Input(shape=(max_sequence_size, 300))
lstm_word_out_1 = LSTM_word_1(lstm_word_input_1)


merged_feature_vectors = Dense(50, activation='sigmoid')(Dropout(0.2)(lstm_word_out_1))

predictions = Dense(classes_num, activation='softmax')(merged_feature_vectors)

my_model = Model(input=[lstm_word_input_1], output=predictions)
print my_model.summary()

我收到的错误是 ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (3019, 300)。在搜索时，我发现人们使用 Flatten() 来压缩密集层的所有二维特征 (3019,300)。但我无法解决这个问题。

在解释的同时，请告诉我维度是如何计算的。

根据要求：

我的 X_training 有尺寸问题，所以我提供下面的代码来消除混淆，

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0.
    #
    # Index2word is a list that contains the names of the words in
    # the model's vocabulary. Convert it to a set, for speed
    index2word_set = set(model.wv.index2word)
    #
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set:
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    #
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec

我认为下面的代码给出了二维 numpy 数组，因为我是这样初始化的

def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), calculate
    # the average feature vector for each one and return a 2D numpy array
    #
    # Initialize a counter
    counter = 0.
    #
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")

    for review in reviews:

       if counter%1000. == 0.:
           print "Question %d of %d" % (counter, len(reviews))

       reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, \
           num_features)

       counter = counter + 1.
    return reviewFeatureVecs


def getCleanReviews(reviews):
    clean_reviews = []
    for review in reviews["question"]:
        clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
    return clean_reviews

我的 objective 只是根据我的一些评论使用 LSTM 的 gensim 预训练模型。

trainDataVecs = getAvgFeatureVecs( getCleanReviews(train), model, num_features )

Answer 1

您应该尝试在 LSTM 层之前使用 Embedding layer。此外，由于您已经为 3019 条评论预训练了 300 维向量，因此您可以使用此矩阵初始化嵌入层的权重。

inp_layer = Input((maxlen,))
x = Embedding(max_features, embed_size, weights=[trainDataVecs])(x)
x = LSTM(50, dropout=0.1)(x)

这里，maxlen 是你的评论的最大长度，max_features 是你的数据集的唯一单词的最大数量或词汇量，embed_size 是你的向量的维度，在你的例子中是 300。

请注意，trainDataVecs 的形状应为 (max_features, embed_size)，因此如果您已将预训练的词向量加载到 trainDataVecs，这应该可行。

预训练词嵌入 gensim 上的 LSTM 网络

LSTM network on pre trained word embedding gensim

python

machine-learning

deep-learning

lstm

word-embedding