RuntimeError: cudnn RNN backward can only be called in training mode
RuntimeError: cudnn RNN backward can only be called in training mode
第一次遇到这个问题,以前的Python项目中从来没有遇到过这样的错误。这是我的训练代码:
def train(net, opt, criterion,ucf_train, batchsize,i):
opt.zero_grad()
total_loss = 0
net=net.eval()
net=net.train()
for vid in range(i*batchsize,i*batchsize+batchsize,1):
output=infer(net,ucf_train[vid])
m=get_label_no(ucf_train[vid])
m=m.cuda( )
loss = criterion(output,m)
loss.backward(retain_graph=True)
total_loss += loss
opt.step() #updates wghts and biases
return total_loss/n_points
推断代码(网络,输入)
def infer(net, name):
net.eval()
hidden_0 = net.init_hidden()
hidden_1 = net.init_hidden()
hidden_2 = net.init_hidden()
video_path = fetch_ucf_video(name)
cap = cv2.VideoCapture(video_path)
resize=(224,224)
T=FrameCapture(video_path)
print(T)
lim=T-(T%20)-2
i=0
while(1):
ret, frame2 = cap.read()
frame2= cv2.resize(frame2, resize)
# print(type(frame2))
if (i%20==0 and i<lim):
input=normalize(frame2)
input=input.cuda()
output,hidden_0,hidden_1, hidden_2 = net(input, hidden_0, hidden_1, hidden_2)
elif (i>=lim):
break
i=i+1
op=output
torch.cuda.empty_cache()
op=op.cuda()
return op
我收到此错误,我尝试在 this 之后使用 model.train()
,其中 net
是我的模型:
RuntimeError Traceback (most recent call last)
<ipython-input-62-42238f3f6877> in <module>()
----> 1 train(net1,opt,criterion,ucf_train,1,0)
2 frames
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
125 Variable._execution_engine.run_backward(
126 tensors, grad_tensors, retain_graph, create_graph,
--> 127 allow_unreachable=True) # allow_unreachable flag
128
129
RuntimeError: cudnn RNN backward can only be called in training mode
您应该删除 def infer(net, name):
之后的 net.eval()
调用
它需要被删除,因为你在你的训练代码中调用了这个推断函数。您的模型需要在整个训练过程中处于训练模式。
并且在调用 eval 之后您也永远不会将模型设置回训练状态,因此这是您遇到的异常的根源。如果你想在你的测试用例中使用这个推断代码,你可以用 if.
覆盖那个用例
另外,total_loss=0
赋值之后的 net.eval()
也没有用,因为您会在之后立即调用 net.train()
。您也可以删除那个,因为它会在下一行被中和。
更新后的代码
def train(net, opt, criterion,ucf_train, batchsize,i):
opt.zero_grad()
total_loss = 0
net=net.train()
for vid in range(i*batchsize,i*batchsize+batchsize,1):
output=infer(net,ucf_train[vid])
m=get_label_no(ucf_train[vid])
m=m.cuda( )
loss = criterion(output,m)
loss.backward(retain_graph=True)
total_loss += loss
opt.step() #updates wghts and biases
return total_loss/n_points
推断代码(网络,输入)
def infer(net, name, is_train=True):
if not is_train:
net.eval()
hidden_0 = net.init_hidden()
hidden_1 = net.init_hidden()
hidden_2 = net.init_hidden()
video_path = fetch_ucf_video(name)
cap = cv2.VideoCapture(video_path)
resize=(224,224)
T=FrameCapture(video_path)
print(T)
lim=T-(T%20)-2
i=0
while(1):
ret, frame2 = cap.read()
frame2= cv2.resize(frame2, resize)
# print(type(frame2))
if (i%20==0 and i<lim):
input=normalize(frame2)
input=input.cuda()
output,hidden_0,hidden_1, hidden_2 = net(input, hidden_0, hidden_1, hidden_2)
elif (i>=lim):
break
i=i+1
op=output
torch.cuda.empty_cache()
op=op.cuda()
return op
第一次遇到这个问题,以前的Python项目中从来没有遇到过这样的错误。这是我的训练代码:
def train(net, opt, criterion,ucf_train, batchsize,i):
opt.zero_grad()
total_loss = 0
net=net.eval()
net=net.train()
for vid in range(i*batchsize,i*batchsize+batchsize,1):
output=infer(net,ucf_train[vid])
m=get_label_no(ucf_train[vid])
m=m.cuda( )
loss = criterion(output,m)
loss.backward(retain_graph=True)
total_loss += loss
opt.step() #updates wghts and biases
return total_loss/n_points
推断代码(网络,输入)
def infer(net, name):
net.eval()
hidden_0 = net.init_hidden()
hidden_1 = net.init_hidden()
hidden_2 = net.init_hidden()
video_path = fetch_ucf_video(name)
cap = cv2.VideoCapture(video_path)
resize=(224,224)
T=FrameCapture(video_path)
print(T)
lim=T-(T%20)-2
i=0
while(1):
ret, frame2 = cap.read()
frame2= cv2.resize(frame2, resize)
# print(type(frame2))
if (i%20==0 and i<lim):
input=normalize(frame2)
input=input.cuda()
output,hidden_0,hidden_1, hidden_2 = net(input, hidden_0, hidden_1, hidden_2)
elif (i>=lim):
break
i=i+1
op=output
torch.cuda.empty_cache()
op=op.cuda()
return op
我收到此错误,我尝试在 this 之后使用 model.train()
,其中 net
是我的模型:
RuntimeError Traceback (most recent call last)
<ipython-input-62-42238f3f6877> in <module>()
----> 1 train(net1,opt,criterion,ucf_train,1,0)
2 frames
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
125 Variable._execution_engine.run_backward(
126 tensors, grad_tensors, retain_graph, create_graph,
--> 127 allow_unreachable=True) # allow_unreachable flag
128
129
RuntimeError: cudnn RNN backward can only be called in training mode
您应该删除 def infer(net, name):
net.eval()
调用
它需要被删除,因为你在你的训练代码中调用了这个推断函数。您的模型需要在整个训练过程中处于训练模式。
并且在调用 eval 之后您也永远不会将模型设置回训练状态,因此这是您遇到的异常的根源。如果你想在你的测试用例中使用这个推断代码,你可以用 if.
覆盖那个用例另外,total_loss=0
赋值之后的 net.eval()
也没有用,因为您会在之后立即调用 net.train()
。您也可以删除那个,因为它会在下一行被中和。
更新后的代码
def train(net, opt, criterion,ucf_train, batchsize,i):
opt.zero_grad()
total_loss = 0
net=net.train()
for vid in range(i*batchsize,i*batchsize+batchsize,1):
output=infer(net,ucf_train[vid])
m=get_label_no(ucf_train[vid])
m=m.cuda( )
loss = criterion(output,m)
loss.backward(retain_graph=True)
total_loss += loss
opt.step() #updates wghts and biases
return total_loss/n_points
推断代码(网络,输入)
def infer(net, name, is_train=True):
if not is_train:
net.eval()
hidden_0 = net.init_hidden()
hidden_1 = net.init_hidden()
hidden_2 = net.init_hidden()
video_path = fetch_ucf_video(name)
cap = cv2.VideoCapture(video_path)
resize=(224,224)
T=FrameCapture(video_path)
print(T)
lim=T-(T%20)-2
i=0
while(1):
ret, frame2 = cap.read()
frame2= cv2.resize(frame2, resize)
# print(type(frame2))
if (i%20==0 and i<lim):
input=normalize(frame2)
input=input.cuda()
output,hidden_0,hidden_1, hidden_2 = net(input, hidden_0, hidden_1, hidden_2)
elif (i>=lim):
break
i=i+1
op=output
torch.cuda.empty_cache()
op=op.cuda()
return op