为什么我的区域提案网络 (RPN) 输出 4d 数组的预测/分数,我如何解释输出?

Why does my Regional Proposal Network (RPN) output 4d array of predictions / scores and how do I interpret output?

简介

我想使用 VGG16 和 Kera (Python) 框架创建区域提议网络 (RPN)。我正在努力了解如何解释 RPN 的输出以预测前景对象的边界框。

为什么RPN会产生一个5x5倍anchor boxes的数组,我怎么知道哪个元素对应一个anchor boxes?

    # Below is some lovely pseudo-code
    array_of_feature_maps = topless_vgg_model.predict(pre_processed_img)
    print(array_of_feature_maps.shape)
    >>> (1,7,7,52)

    all_anchor_boxes = get_potential_boxes_for_region_proposal()
    print(len(all_anchor_boxes))
    >>> 784

    predicted_scores_for_anchor_boxes, predicted_adjustments = rpn_model.predict(input_feature_map)
    # 4 * 784 = 3136
    print(f"Scores Shape = {predicted_scores_for_anchor_boxes.shape}, Adjustments (Deltas) Shape = {predicted_adjustments.shape}")
    >>> Scores Shape = (1,5,5,784), Adjustments (Deltas) Shape = (1,5,5,3136) 

我在创建 RPN 时犯了错误吗?我可以只选择元素 [0][0][0] 并获得分数/增量吗?我关注的主要资源是这些

这是具体的代码

主要功能在配置字典的顶部。

    from keras import Model
    from keras import models
    from keras import optimizers
    from keras import Sequential
    from keras import layers
    from keras import losses
    from keras.preprocessing.image import ImageDataGenerator
    from keras.optimizers import Adam
    import keras.backend as K
    import keras.applications
    from keras import applications
    from keras import utils
    import cv2
    import numpy as np
    import os
    import math
    
    config = {
        "ImgPath" : "1. Data Gen\1. Data\1X9A1712.jpg" #"Put your image path here"
        ,"VGG16InputSize" : (224,224)
        ,"AnchorBox" : {
            "AspectRatioW_div_W" : [1/3,1/2,3/4,1]
            ,"Scales" : [1/2,3/4,1,3/2]
        }
    }

    def main(): ############ MAIN FUNCTION - START HERE ############
        # Get vgg model
        vggmodel = applications.VGG16(include_top=False,weights='imagenet') 
    
        # Extract features for images (used dictionary comprehension to stop getting warning messages from Keras)
        list_of_images = [cv2.imread(config["ImgPath"])]
        array_of_prediction_ready_images = pre_process_image_for_vgg(list_of_images)
        array_of_feature_maps = vggmodel.predict(array_of_prediction_ready_images)
    
        # Find conversions from feature map (CNN output) to input image
        feature_to_input_x_scale, feature_to_input_y_scale, feature_to_input_x_offset, feature_to_input_y_offset = find_feature_map_to_input_scale_and_offset(array_of_prediction_ready_images[0],array_of_feature_maps[0])
    
        # get potential boxes, aka anchor boxes
        potential_boxes = get_potential_boxes_for_region_proposal(array_of_prediction_ready_images[0],array_of_feature_maps[0],feature_to_input_x_scale, feature_to_input_y_scale, feature_to_input_x_offset, feature_to_input_y_offset)
    
        # Create region proposal network
        rpn_model = create_region_proposal_network(len(potential_boxes))
    
        # Output following (height, width, anchor_num)     (height, width, anchor_num * 4)
        predicted_scores_for_anchor_boxes, predicted_adjustments  = rpn_model.predict(array_of_feature_maps)
    
        print(f"predicted_scores_for_anchor_boxes.shape = {predicted_scores_for_anchor_boxes.shape}, predicted_adjustments.shape = {predicted_adjustments.shape}")
        print(f"But why is there the ,5,5, bit? I don't know which ones to choose now to get the predicted bounding box?")
    def pre_process_image_for_vgg(img):
        """
            Resizes the image to input of VGGInputSize specified in the config dictionary
            Normalises the image
            Reshapes the image to an array of images e.g. [[img],[img],..]
    
            If img has a shape of
        """
        if type(img) == np.ndarray: # Single image 
            resized_img = cv2.resize(img,config["VGG16InputSize"],interpolation = cv2.INTER_AREA)
            normalised_image = applications.vgg16.preprocess_input(resized_img)
            reshaped_to_array_of_images = np.array([normalised_image])
            return reshaped_to_array_of_images
        elif type(img) == list: # list of images
            img_list = img
            resized_img_list = [cv2.resize(image,config["VGG16InputSize"],interpolation = cv2.INTER_AREA) for image in img_list]
            resized_img_array = np.array(resized_img_list)
            normalised_images_array = applications.vgg16.preprocess_input(resized_img_array)
            return normalised_images_array
    
    def find_feature_map_to_input_scale_and_offset(pre_processed_input_image,feature_maps):
        """
            Finds the scale and offset from the feature map (output) of the CNN classifier to the pre-processed input image of the CNN
        """
        # Find shapes of feature maps and input images to the classifier CNN
        input_image_shape = pre_processed_input_image.shape
        feature_map_shape = feature_maps.shape
        img_height, img_width, _ = input_image_shape
        features_height, features_width, _ = feature_map_shape
    
        # Find mapping from features map (output of vggmodel.predict) back to the input image
        feature_to_input_x = img_width / features_width
        feature_to_input_y = img_height / features_height
    
        # Put anchor points in the centre of 
        feature_to_input_x_offset = feature_to_input_x/2
        feature_to_input_y_offset = feature_to_input_y/2
    
        return feature_to_input_x, feature_to_input_y, feature_to_input_x_offset, feature_to_input_y_offset
    
    def get_get_coordinates_of_anchor_points(feature_map,feature_to_input_x,feature_to_input_y,x_offset,y_offset):
        """
            Maps the CNN output (Feature map) coordinates on the input image to the CNN 
            Returns the coordinates as a list of dictionaries with the format {"x":x,"y":y}
        """
        features_height, features_width, _ = feature_map.shape
    
        # For the feature map (x,y) determine the anchors on the input image (x,y) as array 
        feature_to_input_coords_x  = [int(x_feature*feature_to_input_x+x_offset) for x_feature in range(features_width)]
        feature_to_input_coords_y  = [int(y_feature*feature_to_input_y+y_offset) for y_feature in range(features_height)]
        coordinate_of_anchor_points = [{"x":x,"y":y} for x in feature_to_input_coords_x for y in feature_to_input_coords_y]
    
        return coordinate_of_anchor_points
    
    def get_potential_boxes_for_region_proposal(pre_processed_input_image,feature_maps,feature_to_input_x, feature_to_input_y, x_offset, y_offset):
        """
            Generates the anchor points (the centre of the enlarged feature map) as an (x,y) position on the input image
            Generates all the potential bounding boxes for each anchor point
            returns a list of potential bounding boxes in the form {"x1","y1","x2","y2"}
        """
        # Find shapes of input images to the classifier CNN
        input_image_shape = pre_processed_input_image.shape
    
        # For the feature map (x,y) determine the anchors on the input image (x,y) as array 
        coordinate_of_anchor_boxes = get_get_coordinates_of_anchor_points(feature_maps,feature_to_input_x,feature_to_input_y,x_offset,y_offset)
    
        # Create potential boxes for classification
        boxes_width_height = generate_potential_box_dimensions(config["AnchorBox"],feature_to_input_x,feature_to_input_y)
        list_of_potential_boxes_for_coords = [generate_potential_boxes_for_coord(boxes_width_height,coord) for coord in coordinate_of_anchor_boxes]
        potential_boxes = [box for boxes_for_coord in list_of_potential_boxes_for_coords for box in boxes_for_coord]
        
        return potential_boxes
    
    def generate_potential_box_dimensions(settings,feature_to_input_x,feature_to_input_y):
        """
            Generate potential boxes height & width for each point aka anchor boxes given the 
            ratio between feature map to input scaling for x and y
            Assumption 1: Settings will have the following attributes
                AspectRatioW_div_W: A list of float values representing the aspect ratios of
                    the anchor boxes at each location on the feature map
                Scales: A list of float values representing the scale of the anchor boxes
                    at each location on the feature map.
        """
        box_width_height = []
        for scale in settings["Scales"]:
            for aspect_ratio_w_div_h in settings["AspectRatioW_div_W"]:
                width = round(feature_to_input_x*scale*aspect_ratio_w_div_h)
                height = round(feature_to_input_y*scale/aspect_ratio_w_div_h)
                box_width_height.append({"Width":width,"Height":height})
        return box_width_height
    
    def generate_potential_boxes_for_coord(box_width_height,coord):
        """
            Assumption 1: box_width_height is an array of dictionary with each dictionary consisting of
                {"Width":positive integer, "Height": positive integer}
            Assumption 2: coord is an array of dictionary with each dictionary consistening of
                {"x":centre of box x coordinate,"y",centre of box y coordinate"}
        """
        potential_boxes = []
        for box_dim in box_width_height:
            potential_boxes.append({
                "x1": coord["x"]-int(box_dim["Width"]/2)
                ,"y1": coord["y"]-int(box_dim["Height"]/2)
                ,"x2": coord["x"]+int(box_dim["Width"]/2)
                ,"y2": coord["y"]+int(box_dim["Height"]/2)
            })
        return potential_boxes
    
    def create_region_proposal_network(number_of_potential_bounding_boxes,number_of_feature_map_channels=512):
        """
            Creates the region proposal network which takes the input of the feature map and 
            Compiles the model and returns it
    
            RPN consists of an input later, a CNN and two output layers.
                output_deltas: 
                output_scores:
    
            Note: Number of feature map channels should be the last element of model.predict().shape
        """
        # Input layer
        feature_map_tile = layers.Input(shape=(None,None,number_of_feature_map_channels),name="RPN_Input_Same")
        # CNN component
        convolution_3x3 = layers.Conv2D(filters=512,kernel_size=(3, 3),name="3x3")(feature_map_tile)
        # Output layers
        output_deltas = layers.Conv2D(filters= 4 * number_of_potential_bounding_boxes,kernel_size=(1, 1),activation="linear",kernel_initializer="uniform",name="Output_Deltas")(convolution_3x3)
        output_scores = layers.Conv2D(filters=1 * number_of_potential_bounding_boxes,kernel_size=(1, 1),activation="sigmoid",kernel_initializer="uniform",name="Output_Prob_FG")(convolution_3x3)
    
        model = Model(inputs=[feature_map_tile], outputs=[output_scores, output_deltas])
    
        # TODO add loss_cls and smoothL1
        model.compile(optimizer='adam', loss={'scores1':losses.binary_crossentropy, 'deltas1':losses.huber})
    
        return model
    
    if __name__ == "__main__":
        main()

我目前的研究

  1. [RPN + extra的逐步解释] - https://dongjk.github.io/code/object+detection/keras/2018/05/21/Faster_R-CNN_step_by_step,_Part_I.html
  2. [vgg with top=false 将只输出特征图,即 (7,7,512),其他解决方案将产生不同的特征] - https://github.com/keras-team/keras/issues/4465
  3. [理解锚框] - https://machinelearningmastery.com/padding-and-stride-for-convolutional-neural-networks/
  4. [Faster RCNN - 他们如何计算步幅] - https://stats.stackexchange.com/questions/314823/how-is-the-stride-calculated-in-the-faster-rcnn-paper
  5. [关于Faster RCNN的好文章解释] - https://medium.com/@smallfishbigsea/faster-r-cnn-explained-864d4fb7e3f8
  6. [表示Anchor boxes应该由比例决定,比例应该是width:height of 1:2 1:1 2:1 比例应该是1 1/2 1/ 3] - https://keras.io/examples/vision/retinanet/
  7. [anchor boxes最佳解释] - https://www.mathworks.com/help/vision/ug/anchor-boxes-for-object-detection.html#:~:text=Anchor%20boxes%20are%20a%20set,sizes%20in%20your%20training%20datasets
  8. [对象检测历史总结,有趣的阅读] - https://dudeperf3ct.github.io/object/detection/2019/01/07/Mystery-of-Object-Detection/
  9. [Mask RCNN Jupyter 笔记本] - https://github.com/matterport/Mask_RCNN/blob/master/samples/coco/inspect_model.ipynb
  10. [我正在尝试理解 Python Keras 中的 RPN] - https://github.com/dongjk/faster_rcnn_keras/blob/master/RPN.py
  11. [RPN 实现 Keras Python] - https://github.com/you359/Keras-FasterRCNN/blob/master/keras_frcnn/data_generators.py
  12. [RPN 实现好评如潮] - https://github.com/virgil81188/Region-Proposal-Network/tree/03025cde75c1d634b608c277e6aa40ccdb829693
  13. [RPN损失函数解释清楚] - https://www.geeksforgeeks.org/faster-r-cnn-ml/
  14. [在 Keras Python 框架中开发的 RPN] - https://github.com/alexmagsam/keras-rpn

哇,你一直到底部,希望你读得愉快!

Keras/Tensorflow 使用张量形状的 BHWC 约定(也称为“通道最后”)。查看 VGG 模型的输出形状,即 (1, 7, 7, 52),这意味着空间网格的大小为 7x7,并且有 52 个通道。您定义的 RPN 输出一个形状为 (1, 5, 5, 784) 的张量,您猜对了,它的空间分辨率低于 VGG 网络。

从您的 RPN 代码来看,解释很简单:您使用了 Conv2D,内核大小为 3x3,填充的默认值为 'valid'。这意味着输出空间扩展将小于输入空间扩展,因为卷积仅发生在“有效”位置,即内核适合输入张量的位置。

padding='same' 将解决这个问题,您将得到一个形状为 (1, 7, 7, 784).

的张量

计算输入张量形状和卷积参数的输出张量形状的公式由(通道优先,它是 PyTorch 文档)给出:

其中 paddingdilationkernel_sizestride(int, int) 的元组,对应于高度维度和宽度维度的值。

我混淆了锚框和锚点。

  • 锚点:与特征图上的坐标相关(backbone 分类器的输出)
  • 锚框:每个锚点的比例和纵横比

我在下面的代码中犯了这个错误,而不是使用 number_of_potential_bounding_boxes,它是 784,它是 feature_width*feature_height*scale_boxes*aspect_ratios。相反,它应该是 16 作为 scale_boxes*aspect_ratios

  # Output layers
        output_deltas = layers.Conv2D(filters= 4 * number_of_potential_bounding_boxes,kernel_size=(1, 1),activation="linear",kernel_initializer="uniform",name="Output_Deltas")(convolution_3x3)
        output_scores = layers.Conv2D(filters=1 * number_of_potential_bounding_boxes,kernel_size=(1, 1),activation="sigmoid",kernel_initializer="uniform",name="Output_Prob_FG")(convolution_3x3)
    

所以输出应该是[:,7,7,16]

下面是哪个

[number of images to be predicted, height of feature map, width of feature map, number of anchor boxes per anchor point scales*aspect ratios]

下面我实现了两个没有回归的简单RPN

  1. https://github.com/alexshellabear/Simple-Region-Proposal-Network
  2. https://github.com/alexshellabear/Still-Simple-Region-Proposal-Network