
Is there a way to send patches of image into a transformer model for inference or combine the patches together to make one image?

我在视觉转换器模型 (deit) 上使用大小为 224x224 的单个图像进行推理。然而,我将图像分成 196 个补丁并操纵一个补丁的像素来检查它的行为。每个补丁的大小为 16x16.

将这些补丁提供给模型时,出现错误:输入图像大小 (16*16)(224*224) 不匹配。当然,该模型是在 224x224 图像上训练的,需要相同的尺寸。一个想法是将这些补丁组合成一个完整的图像,但我在通过时遇到了问题。

单个图像形状:([1, 3, 224, 224]) 分片图像形状:[196, 16, 16, 3]

import torch
from models.deit import deit_small_patch16_224
import matplotlib.pyplot as plt  
import numpy as np
from PIL import Image
import os
from torchvision.transforms import transforms as transforms
from torchvision.utils import make_grid

def into_patches(im, xPieces, yPieces):
    imgwidth, imgheight = im.size
    height = imgheight // yPieces
    width = imgwidth // xPieces
    #fig, axs = plt.subplots(yPieces, xPieces)
    img_list = []
    for i in range(0, yPieces):
        for j in range(0, xPieces):
            box = (j * width, i * height, (j + 1) * width, (i + 1) * height)
            a = im.crop(box)
            np_img = np.asarray(a)
            if i ==6 and j ==5:
                np_img[:] =0
    return img_list

class_names = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

model = deit_small_patch16_224(pretrained=True, use_top_n_heads=8, use_patch_outputs=False)

checkpoint = torch.load("./checkpoint/deit224.t7")
state_dict = checkpoint["model"]
new_state_dict = {}
for key in state_dict:
    new_key = '.'.join(key.split('.')[1:])
    new_state_dict[new_key] = state_dict[key]

model.head = torch.nn.Linear(in_features=model.head.in_features, out_features=10)

img = Image.open("bird.jpeg")
img = img.resize((224, 224), resample=0)
a = np.array(into_patches(img, 14, 14))

img_tensor = torch.tensor(a)

# print(img_tensor.shape)
 with torch.no_grad():
     output = model(img_tensor)
     predicted_class = np.argmax(output)

我收到以下错误:- AssertionError:输入图像大小 (16*3) 与模型 (224*224) 不匹配。

是否可以将这 196 个补丁组合回 224x224 图像?

您可以使用 来提取补丁,对其进行操作,然后 re-arrange 将它们返回到图像中:

# divide the batch of images into non-overlapping patches
u = nnf.unfold(x, kernel_size=16, stride=16, padding=0)
# manipulate patch number 17
u[..., 17] = my_manipulation(u[..., 17])
# fold the patches back together
f = nnf.fold(u, x.shape[-2:], kernel_size=16, stride=16, padding=0)