有没有办法将图像块发送到变换器模型中进行推理或将这些块组合在一起制作一个图像?

Is there a way to send patches of image into a transformer model for inference or combine the patches together to make one image?

我在视觉转换器模型 (deit) 上使用大小为 224x224 的单个图像进行推理。然而,我将图像分成 196 个补丁并操纵一个补丁的像素来检查它的行为。每个补丁的大小为 16x16.

将这些补丁提供给模型时,出现错误:输入图像大小 (16*16)(224*224) 不匹配。当然,该模型是在 224x224 图像上训练的,需要相同的尺寸。一个想法是将这些补丁组合成一个完整的图像,但我在通过时遇到了问题。

单个图像形状:([1, 3, 224, 224]) 分片图像形状:[196, 16, 16, 3]

import torch
from models.deit import deit_small_patch16_224
import matplotlib.pyplot as plt  
import numpy as np
from PIL import Image
import os
from torchvision.transforms import transforms as transforms
from torchvision.utils import make_grid

def into_patches(im, xPieces, yPieces):
    imgwidth, imgheight = im.size
    height = imgheight // yPieces
    width = imgwidth // xPieces
    #fig, axs = plt.subplots(yPieces, xPieces)
    img_list = []
    for i in range(0, yPieces):
        for j in range(0, xPieces):
            box = (j * width, i * height, (j + 1) * width, (i + 1) * height)
            a = im.crop(box)
            np_img = np.asarray(a)
            if i ==6 and j ==5:
                np_img.setflags(write=1)
                np_img[:] =0
            img_list.append(np_img)
    return img_list

class_names = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

model = deit_small_patch16_224(pretrained=True, use_top_n_heads=8, use_patch_outputs=False)

checkpoint = torch.load("./checkpoint/deit224.t7")
state_dict = checkpoint["model"]
new_state_dict = {}
for key in state_dict:
    new_key = '.'.join(key.split('.')[1:])
    new_state_dict[new_key] = state_dict[key]

model.head = torch.nn.Linear(in_features=model.head.in_features, out_features=10)
model.load_state_dict(new_state_dict)
model.eval()



img = Image.open("bird.jpeg")
img = img.resize((224, 224), resample=0)
a = np.array(into_patches(img, 14, 14))

img_tensor = torch.tensor(a)

# print(img_tensor.shape)
 with torch.no_grad():
     output = model(img_tensor)
     predicted_class = np.argmax(output)
     print(predicted_class.item())

我收到以下错误:- AssertionError:输入图像大小 (16*3) 与模型 (224*224) 不匹配。

是否可以将这 196 个补丁组合回 224x224 图像?

您可以使用 来提取补丁,对其进行操作,然后 re-arrange 将它们返回到图像中:

# divide the batch of images into non-overlapping patches
u = nnf.unfold(x, kernel_size=16, stride=16, padding=0)
# manipulate patch number 17
u[..., 17] = my_manipulation(u[..., 17])
# fold the patches back together
f = nnf.fold(u, x.shape[-2:], kernel_size=16, stride=16, padding=0)