PyTorch - 使用 LSTM 的 WSD
PyTorch - WSD using LSTM
我正在尝试使用 PyTorch 的神经模型在 WSD 上复制 Google's research paper。
在对大型数据集进行训练之前,我遇到了一些关于过度拟合模型的问题。
使用这个训练集:
The film was also intended to be the first in a trilogy.
此模型定义:
class WordGuesser(nn.Module):
def __init__(self, hidden_dim, context_dim, embedding_dim, vocabulary_dim, batch_dim, window_dim):
super(WordGuesser, self).__init__()
self.hidden_dim = hidden_dim
self.batch_dim = batch_dim
self.window_dim = window_dim
self.word_embeddings = nn.Embedding(vocabulary_dim, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
#self.extract_context = nn.Linear((2 * window_dim + 1) * hidden_dim, context_dim)
self.extract_context = nn.Linear(hidden_dim, context_dim)
self.predict = nn.Linear(context_dim, vocabulary_dim)
self.hidden = self.init_hidden()
def init_hidden(self):
return (autograd.Variable(torch.zeros(1, self.batch_dim, self.hidden_dim).cuda()),
autograd.Variable(torch.zeros(1, self.batch_dim, self.hidden_dim).cuda()))
def forward(self, sentence, hidden):
embeddings = self.word_embeddings(sentence)
out, self.hidden = self.lstm(embeddings.permute(1, 0, 2), hidden)
lstm_out = out[-1]
context = self.extract_context(lstm_out)
prediction = self.predict(context)
return prediction, context
和这个训练程序:
num_epoch = 100
hidden_units = 512
embedding_dim = 256
context_dim = 256
def mytrain():
lines = open('training/overfit.txt').readlines()
sentences = data.split_to_sentences(lines) #uses spaCy to detect sentences from each line
word2idx=dict() #dictionary is built from the training set
idx2word =dict()
i = 0
for s in sentences:
for t in s.split(' '):
if t in word2idx:
continue
word2idx[t] = i
idx2word[i] = t
i += 1
word2idx['$'] = i #the token to guess the missing word in a sentence
idx2word[i] = '$'
X = list()
Y = list()
for sentence in sentences:
sentence = sentence.split(' ')
for i in range(len(sentence)):
newsentence = list(sentence)
newsentence[i] = '$'
if not sentence[i] in word2idx:
continue
indices = [word2idx[w] for w in newsentence]
label = word2idx[sentence[i]]
X.append(indices)
Y.append(label)
model = WordGuesser(hidden_units, context_dim, embedding_dim, len(word2idx), len(X), len(X[0]))
model.train()
model.cuda()
input = torch.LongTensor(X).cuda()
output = torch.LongTensor(Y).cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
model.hidden = model.init_hidden()
for epoch in range(num_epoch):
model.hidden = model.init_hidden()
model.zero_grad()
input_tensor = autograd.Variable(input)
target_tensor = autograd.Variable(output)
predictions, context = model(input_tensor, model.hidden)
for i, prediction in enumerate(predictions):
sorted_val = sorted(enumerate(np.array(prediction.data)), key=lambda x : x[1], reverse=True)
print([(idx2word[x[0]], x[1]) for x in sorted_val[:5]], idx2word[Y[i]])
loss = criterion(predictions, target_tensor)
loss.backward()
optimizer.step()
print(epoch, loss.data[0])
torch.save(model, "train2.pt")
在训练过程中,模型似乎能够在第 21 个时期之后过度拟合,正如您从以下分数中看到的那样(预测中的前 5 个词和一行中的最后一个词是该词的标签句子):
[('The', 11.362326), ('film', 11.356865), ('also', 7.5573149), ('to',
5.3518314), ('intended', 4.3520432)] The
[('film', 11.073805), ('The', 10.451499), ('also', 7.5498624), ('was',
4.9684553), ('be', 4.0730805)] film
[('was', 11.232123), ('also', 9.9741745), ('the', 6.0156212), ('be',
4.9949703), ('The', 4.5516477)] was
[('also', 9.6998224), ('was', 9.6202812), ('The', 6.345758), ('film',
4.9122157), ('be', 2.6727715)] also
[('intended', 18.344809), ('to', 16.410078), ('film', 10.147289),
('The', 9.8423424), ('$', 9.6181822)] intended
[('to', 12.442947), ('intended', 10.900065), ('film', 8.2598763),
('The', 8.0493736), ('$', 4.4901967)] to
[('be', 12.189278), ('also', 7.7172523), ('was', 7.5415096), ('the',
5.2521734), ('The', 4.1723843)] be
[('the', 15.59604), ('be', 9.3750105), ('first', 8.9820032), ('was',
8.6859236), ('also', 5.0665498)] the
[('first', 10.191225), ('the', 5.1829329), ('in', 3.6020348), ('be',
3.4108081), ('a', 1.5569853)] first
[('in', 14.731103), ('first', 9.3131113), ('a', 5.982264), ('trilogy',
4.2928643), ('be', 0.49548936)] in
[('a', 14.357709), ('in', 8.3088198), ('trilogy', 6.3918238),
('first', 6.2178354), ('intended', 0.95656234)] a
[('trilogy', 14.351434), ('a', 4.5073452), ('in', 4.2348137), ('$',
3.7552347), ('intended', 3.5101018)] trilogy
[('.', 18.152126), ('$', 12.028764), ('to', 9.6003456), ('intended',
8.1202478), ('The', 4.9225812)] .
当 运行 另一个 Python 脚本加载模型并查询以下单词时(使用相同的代码在训练期间打印分数):
The film was also intended to $ the first in a trilogy. be
The film $ also intended to be the first in a trilogy. was
$ film was also intended to be the first in a trilogy. The
我得到这些分数:
[('film', 24.066889), ('$', 20.107487), ('was', 16.855488), ('a',
12.969441), ('in', 8.1248817)] be
[('film', 24.089062), ('$', 20.116539), ('was', 16.891994), ('a',
12.982826), ('in', 8.1167336)] was
[('film', 23.993624), ('$', 20.108011), ('was', 16.891005), ('a',
12.960193), ('in', 8.1577587)] The
我也试过将 False
模式设置为 model.train()
模式,使用 model.eval()
并在 LSTM 分数上调用 topk
,但结果是'令人满意,
通过 torch.save()
仅保存模型的 state_dict()
解决了这个问题,然后在评估阶段使用 model.load_state_dict()
.
将其加载回来
此外,我将句子查询循环包裹在另一个循环中,充当 warm-up(从 here 获取)并且在最后一次循环时,我设置 model.eval()
并打印出分数,事实证明这是正确的。
我正在尝试使用 PyTorch 的神经模型在 WSD 上复制 Google's research paper。
在对大型数据集进行训练之前,我遇到了一些关于过度拟合模型的问题。
使用这个训练集:
The film was also intended to be the first in a trilogy.
此模型定义:
class WordGuesser(nn.Module):
def __init__(self, hidden_dim, context_dim, embedding_dim, vocabulary_dim, batch_dim, window_dim):
super(WordGuesser, self).__init__()
self.hidden_dim = hidden_dim
self.batch_dim = batch_dim
self.window_dim = window_dim
self.word_embeddings = nn.Embedding(vocabulary_dim, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
#self.extract_context = nn.Linear((2 * window_dim + 1) * hidden_dim, context_dim)
self.extract_context = nn.Linear(hidden_dim, context_dim)
self.predict = nn.Linear(context_dim, vocabulary_dim)
self.hidden = self.init_hidden()
def init_hidden(self):
return (autograd.Variable(torch.zeros(1, self.batch_dim, self.hidden_dim).cuda()),
autograd.Variable(torch.zeros(1, self.batch_dim, self.hidden_dim).cuda()))
def forward(self, sentence, hidden):
embeddings = self.word_embeddings(sentence)
out, self.hidden = self.lstm(embeddings.permute(1, 0, 2), hidden)
lstm_out = out[-1]
context = self.extract_context(lstm_out)
prediction = self.predict(context)
return prediction, context
和这个训练程序:
num_epoch = 100
hidden_units = 512
embedding_dim = 256
context_dim = 256
def mytrain():
lines = open('training/overfit.txt').readlines()
sentences = data.split_to_sentences(lines) #uses spaCy to detect sentences from each line
word2idx=dict() #dictionary is built from the training set
idx2word =dict()
i = 0
for s in sentences:
for t in s.split(' '):
if t in word2idx:
continue
word2idx[t] = i
idx2word[i] = t
i += 1
word2idx['$'] = i #the token to guess the missing word in a sentence
idx2word[i] = '$'
X = list()
Y = list()
for sentence in sentences:
sentence = sentence.split(' ')
for i in range(len(sentence)):
newsentence = list(sentence)
newsentence[i] = '$'
if not sentence[i] in word2idx:
continue
indices = [word2idx[w] for w in newsentence]
label = word2idx[sentence[i]]
X.append(indices)
Y.append(label)
model = WordGuesser(hidden_units, context_dim, embedding_dim, len(word2idx), len(X), len(X[0]))
model.train()
model.cuda()
input = torch.LongTensor(X).cuda()
output = torch.LongTensor(Y).cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
model.hidden = model.init_hidden()
for epoch in range(num_epoch):
model.hidden = model.init_hidden()
model.zero_grad()
input_tensor = autograd.Variable(input)
target_tensor = autograd.Variable(output)
predictions, context = model(input_tensor, model.hidden)
for i, prediction in enumerate(predictions):
sorted_val = sorted(enumerate(np.array(prediction.data)), key=lambda x : x[1], reverse=True)
print([(idx2word[x[0]], x[1]) for x in sorted_val[:5]], idx2word[Y[i]])
loss = criterion(predictions, target_tensor)
loss.backward()
optimizer.step()
print(epoch, loss.data[0])
torch.save(model, "train2.pt")
在训练过程中,模型似乎能够在第 21 个时期之后过度拟合,正如您从以下分数中看到的那样(预测中的前 5 个词和一行中的最后一个词是该词的标签句子):
[('The', 11.362326), ('film', 11.356865), ('also', 7.5573149), ('to', 5.3518314), ('intended', 4.3520432)] The
[('film', 11.073805), ('The', 10.451499), ('also', 7.5498624), ('was', 4.9684553), ('be', 4.0730805)] film
[('was', 11.232123), ('also', 9.9741745), ('the', 6.0156212), ('be', 4.9949703), ('The', 4.5516477)] was
[('also', 9.6998224), ('was', 9.6202812), ('The', 6.345758), ('film', 4.9122157), ('be', 2.6727715)] also
[('intended', 18.344809), ('to', 16.410078), ('film', 10.147289), ('The', 9.8423424), ('$', 9.6181822)] intended
[('to', 12.442947), ('intended', 10.900065), ('film', 8.2598763), ('The', 8.0493736), ('$', 4.4901967)] to
[('be', 12.189278), ('also', 7.7172523), ('was', 7.5415096), ('the', 5.2521734), ('The', 4.1723843)] be
[('the', 15.59604), ('be', 9.3750105), ('first', 8.9820032), ('was', 8.6859236), ('also', 5.0665498)] the
[('first', 10.191225), ('the', 5.1829329), ('in', 3.6020348), ('be', 3.4108081), ('a', 1.5569853)] first
[('in', 14.731103), ('first', 9.3131113), ('a', 5.982264), ('trilogy', 4.2928643), ('be', 0.49548936)] in
[('a', 14.357709), ('in', 8.3088198), ('trilogy', 6.3918238), ('first', 6.2178354), ('intended', 0.95656234)] a
[('trilogy', 14.351434), ('a', 4.5073452), ('in', 4.2348137), ('$', 3.7552347), ('intended', 3.5101018)] trilogy
[('.', 18.152126), ('$', 12.028764), ('to', 9.6003456), ('intended', 8.1202478), ('The', 4.9225812)] .
当 运行 另一个 Python 脚本加载模型并查询以下单词时(使用相同的代码在训练期间打印分数):
The film was also intended to $ the first in a trilogy. be
The film $ also intended to be the first in a trilogy. was
$ film was also intended to be the first in a trilogy. The
我得到这些分数:
[('film', 24.066889), ('$', 20.107487), ('was', 16.855488), ('a', 12.969441), ('in', 8.1248817)] be
[('film', 24.089062), ('$', 20.116539), ('was', 16.891994), ('a', 12.982826), ('in', 8.1167336)] was
[('film', 23.993624), ('$', 20.108011), ('was', 16.891005), ('a', 12.960193), ('in', 8.1577587)] The
我也试过将 False
模式设置为 model.train()
模式,使用 model.eval()
并在 LSTM 分数上调用 topk
,但结果是'令人满意,
通过 torch.save()
仅保存模型的 state_dict()
解决了这个问题,然后在评估阶段使用 model.load_state_dict()
.
此外,我将句子查询循环包裹在另一个循环中,充当 warm-up(从 here 获取)并且在最后一次循环时,我设置 model.eval()
并打印出分数,事实证明这是正确的。