如何select 随机数的随机MNIST数字，其标签不重复，同时排除某个数字？

Question

我是编码新手，感谢任何帮助，但请对我温柔点。

我正在使用神经网络的 MNIST 数据库，因为我想将结果转移到另一个问题。我正在尝试做的是通过将一组图像包含到要 class 化的图像来操纵 MNIST 训练数据集。请允许我构建方法：

在训练神经网络时，MNIST 数据库提供手写数字 (x_train) 及其 label/class (y_train)
但是我不仅要用单个图像输入来训练神经网络，还要为神经网络提供可供选择的可选图像
所以如果我想让机器class验证数字“5”，我会得到数字“5”的图像以及一组随机图像，它们应该有一个随机数量，作为输入：

-> 输入 = 图片要 class 化为“5” |图像引用“1”、“4”、“5”，下一个是图片要 class 化为“0” |图片引用“0”、“9”、“3”、“5”、“6”等...

"images to refer to" 应始终包含 "digit to classify" 而不是 "image to classify"。意味着 "Image to classify "5"" 的索引不应与 "Images to refer to ... "5""
到目前为止，我设法选择了随机数量 (random_with_N_digits()) 数字的随机图像 (digit_randomizer())。我想念的是：
1. 排除自身的索引：classify 的“5”索引不是可供选择的“5”索引
2. 要引用的图片不能有重复数字

To 1.: 下面你可以看到我的函数digit_randomizer()。我目前不知道如何解决这个问题，但是使用嵌套循环检查 "np.where(j != i)"

To 2.: 我正在考虑将 y_train 分成 10 组不同的标签（每组一个数字）。但是我不知道我应该写什么样的命令，因为我需要定义随机数量的图像。从这 10 组图像中随机选择一张，同时注意索引。

到目前为止，这是我的代码：

import keras as k
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D
import matplotlib.pyplot as plt

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()


print('')
print('x_train shape:', x_train.shape)

# Reshaping the array to 4-dims so that it can work with the Keras API
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)
# Making sure that the values are float so that we can get decimal points after division
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalizing the RGB codes by dividing it to the max RGB value.
x_train /= 255
x_test /= 255
print('x_train shape reshaped:', x_train.shape)
print('Number of images in x_train', x_train.shape[0])
print('Number of images in x_test', x_test.shape[0])

classes = set(y_train)
print('Number of classes =', len(classes),'\n')
print('Classes: \n', classes, '\n')

print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)


import random
import warnings
warnings.filterwarnings("ignore")
from random import randint

#select a random image from the training data
def digit_select():
    for j in range(1):
        j = np.random.choice(np.arange(0, len(y_train)), size = (1,))
        digit = (x_train[j] * 255).reshape((28, 28)).astype("uint8")
        imgplot = plt.imshow(digit)
        plt.title(y_train[j])
        imgplot.set_cmap('Greys')
        plt.show()

# return between 1 or 10 images
def random_with_N_digits():
    range_start = 0
    range_end = 9
    return randint(range_start, range_end)

# return 1 or 10 random images
def digit_randomizer():  
    for i in range(random_with_N_digits()):
        i = np.random.choice(np.arange(0, len(y_train)), size = (1,))
        image = (x_train[i] * 255).reshape((28, 28)).astype("uint8")
        imgplot = plt.imshow(image)
        imgplot.set_cmap('Greys')
        plt.title(y_train[i])
        plt.show()

不知何故 digit_select 应该从 digit_randomizer 中排除，并且 digit_randomizer 应该只从 y_train 中为每个 class 选择一张图像。

非常感谢任何想法。

代码编辑：

def digit_label_randselect():
    j = np.random.choice(np.arange(0, len(y_train)), size=(1,))
    return int(y_train[j])
print('Randomly selected label:', digit_label_randselect())

Output: Randomly selected label: 4

def n_reference_digits(input_digit_label):
    other_digits = list(np.unique(y_train)) #create list with all digits
    other_digits.remove(input_digit_label) #remove the input digit label
    sample = random.sample(other_digits, len(np.unique(y_train))-1) #Take a sample of size n of the digits
    sample.append(input_digit_label)
    random.shuffle(sample)
    return sample
print('Randomly shuffled labels:', n_reference_digits(digit_label_randselect()))

Output: Randomly shuffled labels: [8, 0, 6, 2, 7, 4, 3, 5, 9, 1]


'''returns a list of 10 random indices.
necessary to choose random 10 digits as a set, which will be used to train the NN.
the set needs to contain a single identical y_train value (label),
meaning that a random digit definitely has the same random digit in the set.
however their indices have to differ. moreover all y_train values (labels) have to be different,
meaning that the set contains a single representation of every digit.'''
def digit_indices_randselect():
    listi = []
    for i in range(10):
        i = np.random.choice(np.arange(0, len(y_train)), size = (1,))
        listi.append(i)
    return listi
listindex = digit_indices_randselect()
print('Random list of indices:', listindex)

Output: Random list of indices: [array([53451]), array([31815]), array([4519]), array([21354]), array([14855]), array([45147]), array([42903]), array([37681]), array([1386]), array([9584])]

'''for every index in listindex return the corresponding index, pixel array and label'''
#TO DO: One Hot Encode the labels
def array_and_label_for_digit_indices_randselect():
    listi = []
    digit_data = []
    labels = []
    for i in listindex:
        digit_array = x_train[i] #digit data (image array) is the data from index i
        label = y_train[i] #corresponding label
        listi.append(i)
        digit_data.append(digit_array)
        labels.append(label)
    list3 = list(zip(listi, digit_data, labels))
    return list3
array_and_label_for_digit_indices_randselect()


Output:[(array([5437]),
  array([[[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   4,  29,  29,  29,
            29,  29,  29,  29,  92,  91, 141, 241, 255, 228,  94,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,  45, 107, 179, 252, 252, 252,
           253, 252, 252, 252, 253, 252, 252, 252, 253, 252, 224,  19,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,  45, 240, 252, 253, 252, 252, 252,
           253, 252, 252, 252, 253, 252, 252, 252, 253, 252, 186,   6,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0, 157, 252, 252, 253, 252, 252, 252,
           253, 252, 252, 252, 241, 215, 252, 252, 253, 202,  19,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,  41, 253, 253, 253, 255, 234, 100,   0,
             0,   0,   0,   0,   0,  70, 253, 253, 251, 125,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,  66, 252, 252, 252, 253, 133,   0,   0,
             0,   0,   0,   0,   0, 169, 252, 252, 200,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   7, 130, 168, 168, 106,  19,   0,   0,
             0,   0,   0,   0,  10, 197, 252, 252, 113,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0, 128, 252, 252, 252,  63,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,  13, 204, 253, 253, 241,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,  88, 253, 252, 233, 109,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0, 225, 253, 252, 234,  22,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,  38, 237, 253, 252, 164,  15,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,  26, 172, 253, 254, 228,  31,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0, 114, 234, 252, 253, 139,  19,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           111, 234, 252, 252,  94,  19,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           241, 252, 252, 202,  13,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  76,
           254, 253, 253,  78,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 225,
           253, 252, 233,  22,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 225,
           253, 233,  62,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  38, 187,
           241,  59,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0]]], dtype=uint8),
  array([7], dtype=uint8)),...


'''for every index in x_train return a random index, its pixel array, label.
also return a list of 10 random indices (digit_indices_randselect())'''
def digit_with_set():
    for i in x_train:
        i = random.randrange(len(y_train)) #returns a random index of a digit between 0 - 60000
        digit_data = x_train[i] #digit data (image array) is the data from index i
        label = y_train[i] #corresponding label
        #digit_randomizer() #returns a random set of 10 images
        print("Index of digit to classify:", i), \
        print("Digit to classify:", label), \
        print("Corresponding array:", digit_data), \
        print("Set of 10 Images:", array_and_label_for_digit_indices_randselect())
        print("")
        print("Next:")
digit_with_set()```



***PURPOSE Edit:*** The purpose of this approach is to research, whether a neural network can devise a model, which not only classifies the input, but also recognizes the possibility from choosing a label from the optional set. Meaning that the model not only classifies the "5" as the "5", but also looks into its options and finds a fit there as well.

This may not make much sense for an image classification problem. However I am working on a sequence to sequence problem in another project. The input sequences are in multiple columns in a .csv file. The output is another sequence. The issue lies within the very heterogenous input, so the accuracy is low and loss is very high.

This is how the data is structured:

**Input**: | AA_10| 31.05.2019 | CW20 | Project1 |   **Output**: AA_Project1_[11]

**Input**: |      | CW19       |      | Project2 |   **Output**: AA_Project2_[3]

**Input**: | y550 | 01.06.2019 | AA12 | Project1 |   **Output**: AA_Project1_[12]

The AA_ProjectX_[Value] within the output is the main issue since its range varies from project to project. Project1 can have [0-12], Project 2 can have [0-20], ProjectX [0-N].

By adding a range of values to the input data I hope to restrict the network from learning values which are not part of the project.

Input: | AA_10| 31.05.2019 | CW20 | Project1 | [0,1,2,3,4,5,6,7,8,9,10,11,12] |  Output: AA_Project1_[11]

So when I want to classify the digit 5, I give the machine a range of possibile classes and corresponding images to derive the output class from.

Answer 1

虽然我不太明白你这样做的目的，但你可以在 data/labels 中随机选择一些索引，注意不要选择不需要的数字。

import random

data = [["digit1"],["digit3"],["digit1"],["digit2"],["digit3"],["digit1"],["digit2"],        
     ["digit1"],["digit2"],["digit3"]]

labels = [1,3,1,2,3,1,2,1,2,3]

unwanted_label = 1
nb_samples = 3

samples = random.sample([(i, j) for i, j in zip(data, labels) if j!=unwanted_label],nb_samples)

print(list(zip(*samples)))

您将随机获得您的数据和相关标签，如下所示：

[(['digit2'], ['digit3'], ['digit3']), (2, 3, 3)]

Answer 2

您问了多个问题。这在 SO 上是不鼓励的，因为你的整个问题对你来说是非常具体的，因此不会帮助其他人。尝试将您的问题分解为更基本的问题，然后您可以搜索这些问题（因为它们可能已经得到回答）或单独询问。如果您的问题更加模块化，您会发现人们更有可能回复。

在我具体讨论您的问题之前，我对您的代码有一些评论，因为您提到自己是一名编码新手。我尽可能地明确，代码可能更短或更优雅，但我选择表现力作为我的主要目标。

一个函数是一段模块化的代码，最好有一个工作。因此，我不会将可视化代码放入您的 digit_select 中。我会为可视化创建一个单独的函数。也许是这样的：

def vis_digit(index):
    digit = (x_train[index] * 255).reshape((28, 28)).astype("uint8")
    imgplot = plt.imshow(digit)
    plt.title(y_train[index])
    imgplot.set_cmap('Greys')
    plt.show()

现在我们可以进一步重构digit_select。我认为您在这里不需要 for 循环。据我了解，该方法仅 select 一张随机图像，因此您无需重复操作。您现在编写它的方式无论如何都不会重复代码，因为 range(1) 给出了一个仅包含 0 的可迭代对象。此外，j 是您用于 select 训练图像的索引，这可以是一个普通整数。因此，您可以使用 random.randrange 或 random.randint，我更喜欢 randrange（请阅读文档以了解两者之间的区别）。您想记住您使用的是哪张图片，因为这张图片不能在您的参考集中，因此我建议return j。 digit_select 方法可能如下所示：

def digit_select():
    digit_index = random.randrange(len(y_train))
    digit_data = x_train[digit_index]
    label = y_train[digit_index]
    return digit_index, digit_data, label

现在，我将回答你复合问题的一个方面，据我所知："How do I select a random list of unique numbers that must include a specific number?"。这可以用谷歌搜索，例如 this。我在我的答案中使用了接受的答案。

我会使用 return 一些数字标签列表的函数，其中包括所需的输入数字标签。

def n_reference_digits(input_digit_label):
    other_digits = list(range(10)) #create list with all digits
    other_digits.remove(input_digit_label) #remove the input digit label
    n = random.randrange(10) #pick a random n [0,10)
    sample = random.sample(other_digits, n) #Take a sample of size n of the digits
    sample.append(input_digit_label)
    return sample

现在，我明白这还没有完成，但试着弄清楚下一步是什么。尝试用谷歌搜索这个小步骤，但找不到答案。只是问一个新的（更具体一点）问题。 :)

Answer 3

您可以将数据组合在一起（加入训练和测试）并将它们转换为 pandas 数据帧，然后只需执行这两行就可以了：

bad_labels = df[df['labels'] == X].sample(amount).index df= df[~df.index.isin(bad_labels)]

其中 data['labels'] 代表数据框的最后一列（您的标签），X 是您希望排除的标签，amount 是您希望保留的随机数数量。

这是加入数据的方法：

import numpy as np data = np.concatenate((x_train, x_test), axis=0)

转换为pandas:

import pandas as pd df = pd.DataFrame(data)

如何将 y 标签添加到数据框： df['labels'] = y

要恢复训练和测试部分数据，您可以使用 sklearn 的 train_test_split() 函数

如何select 随机数的随机MNIST数字，其标签不重复，同时排除某个数字？

How to select random MNIST digits of random quantity, whose labels do not repeat, while excluding a certain digit?

python

random

image-processing

digits

mnist