比较LSTM结构
Comparing LSTM structure
我正在尝试根据该图片构建 LSTM 模型。
我是深度学习的初学者,特别是 RNN 结构,所以我需要你的建议来引导我
因此,我正在处理一个包含 70k 用户和 12k 动漫的数据框,我的数据框包含:
用户id
用户评分
动漫id
类型:与动漫相关的标签列表,例如:动作、喜剧、学校...等
users_tags :由于 tfifd 方法和一些与用户相关的文本数据,我为唯一用户构建的 15 个唯一标签的列表
我的数据框看起来像:
anime_id user_id user_rating name tags genre
0 1 234 9.0 Cowboy Bebop drama , fi , mal action , military , sci fi , ... Action, Adventure, Comedy, Drama, Sci-Fi, Space
1 1 382 10.0 Cowboy Bebop life , shiki , tv , thriller , movie short , c... Action, Adventure, Comedy, Drama, Sci-Fi, Space
2 1 160 9.0 Cowboy Bebop fantasy , action , supernatural , tv , mystery... Action, Adventure, Comedy, Drama, Sci-Fi, Space
3 1 341 8.0 Cowboy Bebop action , school , romance , new , short , mal ... Action, Adventure, Comedy, Drama, Sci-Fi, Space
4 1 490 9.0 Cowboy Bebop mal adventure , movie short , school , strange... Action, Adventure, Comedy, Drama, Sci-Fi,
这里是我用于模型的参数:
#parameters
users = interactions_full_df.user_id.unique()
animes = interactions_full_df.anime_id.unique()
animes_tags = " ".join(interactions_full_df["genre"].unique()).split(",")
n_animes_tags = len(animes_tags)
n_users = len(users)
n_animes = len(animes)
n_users_tags = 15
我为我的 "latent dim" 嵌入层设置了 100。
在此,我尝试构建此模型。你能说我走的路对不对吗?
""" The lstm cell is the concatenation of 3 things :
--> 1.0 Anime Embedding Vector
--> 2.0 Average of :
--> 2.1 Tags embedding vectors associated with the current anime
--> 2.2 Tags embedding vectors associated with the next anime in a sequence
"""
# 1.0
animes_input = Input(shape=[1],name='Anime')
animes_embedding = Embedding(n_animes + 1,
latent_dim,
name='Animes-Embedding')(animes_input)
""" I suppose we need Users embedding to find what's anime chosen by users ??"""
Users_input = Input(shape=[1],name='Users')
Users_embedding = Embedding(n_users + 1,
latent_dim,
name='Users-Embeddings')(Users_input)
#2.0
# 2.1
""" Anime Tags """
animes_tags_input = Input(shape=[1],name='anime_tags')
tags_embedding = Embedding(n_animes_tags + 1,
latent_dim,
name='Animes-Tags-embedding')(animes_tags_input)
#2.2 : tags of future anime in a sequence ???
#my input will be a padded sequence of tags used as a string object <<<<<----
inp_shape = max_sequence_len - 1
input_len = Input(shape=[inp_shape], name = "future_tags")
sequence_tags_embeddings = Embedding(tags_total_words, latent_dim)(input_len)
sequence_lstm_cells = LSTM(30)(sequence_tags_embeddings)
future_tags_embedding = Dense(latent_dim, activation='softmax')(sequence_lstm_cells) #???????????? i'm not sure at all
# then average them
averaged_tags = average([tags_embedding, future_tags_embedding])
#then we need to concatenate all of them
merged_cell = merge([averaged_tags, animes_embedding, Users_embedding])
# My lstm cells is ready : the structure seems to be an Many to One (may be i'm wrong ?)
n_neurons = 100
lstm_cell = LSTM(30, input_shape=(10, 1))(merged_cell)
result = Dense(1, activation='softmax', name = "Recommendation")(lstm_cell)
LSTM_MODEL = Model([animes_input, animes_tags_input, Users_input, input_len], result)
LSTM_MODEL.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
LSTM_MODEL.summary()
对于 "future tags" 的部分,我使用了像这样的填充标签序列:
def get_sequence_of_tokens(corpus):
## tokenization
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
## convert data to sequence of tokens
input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
return input_sequences, total_words
def generate_padded_sequences(input_sequences, input_total_words):
max_sequence_len = max([len(x) for x in tqdm(input_sequences)])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
label = ku.to_categorical(label, num_classes=input_total_words)
return predictors, label, max_sequence_len
print("create list ..")
train_tags_anime_list = [get_tags_anime(anime_id) for anime_id in tqdm(train["anime_id"])]
test_tags_anime_list = [get_tags_anime(anime_id) for anime_id in tqdm(valid["anime_id"])]
print("cleaning ...")
train_tags_corpus = [clean_text(x) for x in tqdm(train_tags_anime_list)]
valid_tags_corpus = [clean_text(x) for x in tqdm(test_tags_anime_list)]
print("tokenization ..")
train_tags_inp_sequences, train_tags_total_words = get_sequence_of_tokens(train_tags_corpus)
valid_tags_inp_sequences, valid_tags_total_words = get_sequence_of_tokens(valid_tags_corpus)
print("padd sequence")
train_tags_predictors, train_tags_label, train_max_sequence_len = generate_padded_sequences(train_tags_inp_sequences, train_tags_total_words)
valid_tags_predictors, valid_tags_label, valid_max_sequence_len = generate_padded_sequences(valid_tags_inp_sequences, valid_tags_total_words)
您想构建一个 Stacked LSTM 具有多个特征的网络(您命名的参数通常称为特征),这在 https://machinelearningmastery.com/stacked-long-short-term-memory-networks/ and https://machinelearningmastery.com/use-features-lstm-networks-time-series-forecasting/ and https://datascience.stackexchange.com/questions/17024/rnns-with-multiple-features
RNN 等 LSTM 只能处理顺序数据,但是这可以通过多维特征向量(您的参数集合如在 https://datascience.stackexchange.com/questions/17024/rnns-with-multiple-features )
中回答
显示的 2 层 6 个 LSTM 单元的结构是一个 Stacked LSTM 网络,有 2 层 feature_dim = data_dim=6 (or 7)
(参数/特征的数量)和 timesteps=3
(2 层有 3 个单元每层)cf section Stacked LSTM for sequence classification in https://keras.io/getting-started/sequential-model-guide/ and for keras code.
设置准确的输入形状至关重要 cf ,你的网络是多对多的情况。
传递给 LSTM 的输入的形状应采用 (num_samples,timesteps,data_dim)
形式,其中 data_dim
是特征向量或参数向量
嵌入层用于 One-Hot 编码 cf https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12 for keras code see https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12 and https://keras.io/layers/embeddings/ , possibly you could also use simple label encoding ( http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html , http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder )
我正在尝试根据该图片构建 LSTM 模型。 我是深度学习的初学者,特别是 RNN 结构,所以我需要你的建议来引导我
因此,我正在处理一个包含 70k 用户和 12k 动漫的数据框,我的数据框包含:
用户id
用户评分
动漫id
类型:与动漫相关的标签列表,例如:动作、喜剧、学校...等
users_tags :由于 tfifd 方法和一些与用户相关的文本数据,我为唯一用户构建的 15 个唯一标签的列表
我的数据框看起来像:
anime_id user_id user_rating name tags genre
0 1 234 9.0 Cowboy Bebop drama , fi , mal action , military , sci fi , ... Action, Adventure, Comedy, Drama, Sci-Fi, Space
1 1 382 10.0 Cowboy Bebop life , shiki , tv , thriller , movie short , c... Action, Adventure, Comedy, Drama, Sci-Fi, Space
2 1 160 9.0 Cowboy Bebop fantasy , action , supernatural , tv , mystery... Action, Adventure, Comedy, Drama, Sci-Fi, Space
3 1 341 8.0 Cowboy Bebop action , school , romance , new , short , mal ... Action, Adventure, Comedy, Drama, Sci-Fi, Space
4 1 490 9.0 Cowboy Bebop mal adventure , movie short , school , strange... Action, Adventure, Comedy, Drama, Sci-Fi,
这里是我用于模型的参数:
#parameters
users = interactions_full_df.user_id.unique()
animes = interactions_full_df.anime_id.unique()
animes_tags = " ".join(interactions_full_df["genre"].unique()).split(",")
n_animes_tags = len(animes_tags)
n_users = len(users)
n_animes = len(animes)
n_users_tags = 15
我为我的 "latent dim" 嵌入层设置了 100。
在此,我尝试构建此模型。你能说我走的路对不对吗?
""" The lstm cell is the concatenation of 3 things :
--> 1.0 Anime Embedding Vector
--> 2.0 Average of :
--> 2.1 Tags embedding vectors associated with the current anime
--> 2.2 Tags embedding vectors associated with the next anime in a sequence
"""
# 1.0
animes_input = Input(shape=[1],name='Anime')
animes_embedding = Embedding(n_animes + 1,
latent_dim,
name='Animes-Embedding')(animes_input)
""" I suppose we need Users embedding to find what's anime chosen by users ??"""
Users_input = Input(shape=[1],name='Users')
Users_embedding = Embedding(n_users + 1,
latent_dim,
name='Users-Embeddings')(Users_input)
#2.0
# 2.1
""" Anime Tags """
animes_tags_input = Input(shape=[1],name='anime_tags')
tags_embedding = Embedding(n_animes_tags + 1,
latent_dim,
name='Animes-Tags-embedding')(animes_tags_input)
#2.2 : tags of future anime in a sequence ???
#my input will be a padded sequence of tags used as a string object <<<<<----
inp_shape = max_sequence_len - 1
input_len = Input(shape=[inp_shape], name = "future_tags")
sequence_tags_embeddings = Embedding(tags_total_words, latent_dim)(input_len)
sequence_lstm_cells = LSTM(30)(sequence_tags_embeddings)
future_tags_embedding = Dense(latent_dim, activation='softmax')(sequence_lstm_cells) #???????????? i'm not sure at all
# then average them
averaged_tags = average([tags_embedding, future_tags_embedding])
#then we need to concatenate all of them
merged_cell = merge([averaged_tags, animes_embedding, Users_embedding])
# My lstm cells is ready : the structure seems to be an Many to One (may be i'm wrong ?)
n_neurons = 100
lstm_cell = LSTM(30, input_shape=(10, 1))(merged_cell)
result = Dense(1, activation='softmax', name = "Recommendation")(lstm_cell)
LSTM_MODEL = Model([animes_input, animes_tags_input, Users_input, input_len], result)
LSTM_MODEL.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
LSTM_MODEL.summary()
对于 "future tags" 的部分,我使用了像这样的填充标签序列:
def get_sequence_of_tokens(corpus):
## tokenization
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
## convert data to sequence of tokens
input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
return input_sequences, total_words
def generate_padded_sequences(input_sequences, input_total_words):
max_sequence_len = max([len(x) for x in tqdm(input_sequences)])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
label = ku.to_categorical(label, num_classes=input_total_words)
return predictors, label, max_sequence_len
print("create list ..")
train_tags_anime_list = [get_tags_anime(anime_id) for anime_id in tqdm(train["anime_id"])]
test_tags_anime_list = [get_tags_anime(anime_id) for anime_id in tqdm(valid["anime_id"])]
print("cleaning ...")
train_tags_corpus = [clean_text(x) for x in tqdm(train_tags_anime_list)]
valid_tags_corpus = [clean_text(x) for x in tqdm(test_tags_anime_list)]
print("tokenization ..")
train_tags_inp_sequences, train_tags_total_words = get_sequence_of_tokens(train_tags_corpus)
valid_tags_inp_sequences, valid_tags_total_words = get_sequence_of_tokens(valid_tags_corpus)
print("padd sequence")
train_tags_predictors, train_tags_label, train_max_sequence_len = generate_padded_sequences(train_tags_inp_sequences, train_tags_total_words)
valid_tags_predictors, valid_tags_label, valid_max_sequence_len = generate_padded_sequences(valid_tags_inp_sequences, valid_tags_total_words)
您想构建一个 Stacked LSTM 具有多个特征的网络(您命名的参数通常称为特征),这在 https://machinelearningmastery.com/stacked-long-short-term-memory-networks/ and https://machinelearningmastery.com/use-features-lstm-networks-time-series-forecasting/ and https://datascience.stackexchange.com/questions/17024/rnns-with-multiple-features
RNN 等 LSTM 只能处理顺序数据,但是这可以通过多维特征向量(您的参数集合如在 https://datascience.stackexchange.com/questions/17024/rnns-with-multiple-features )
中回答显示的 2 层 6 个 LSTM 单元的结构是一个 Stacked LSTM 网络,有 2 层 feature_dim = data_dim=6 (or 7)
(参数/特征的数量)和 timesteps=3
(2 层有 3 个单元每层)cf section Stacked LSTM for sequence classification in https://keras.io/getting-started/sequential-model-guide/ and
设置准确的输入形状至关重要 cf (num_samples,timesteps,data_dim)
形式,其中 data_dim
是特征向量或参数向量
嵌入层用于 One-Hot 编码 cf https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12 for keras code see https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12 and https://keras.io/layers/embeddings/ , possibly you could also use simple label encoding ( http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html , http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder )