为灰度 MRI 数据的二元分类构建 3D CNN,尝试时出现数据维数问题 model.fit

Building 3D CNN for binary classification of greyscale MRI data, Data dimensionality issue when attempting model.fit

我正在尝试构建 3D CNN 以对灰度 MRI 数据进行二元分类。我是新手,所以不要费力,我是来学习的!我有 20 个 3D 文件的子样本,尺寸为(189、233、197)。我添加一个维度作为通道,使用 np.reshape 得到 (189, 233, 197, 1)。我使用 tf.shape 来获取数据集的形状,即

<tf.Tensor: shape=(5,), dtype=int32, numpy=array([ 20, 189, 233, 197,   1], dtype=int32)>

标签数据也是如此

<tf.Tensor: shape=(1,), dtype=int32, numpy=array([20], dtype=int32)>

下面是我使用的完整代码:

import numpy as np
import glob
import os
import tensorflow as tf
import pandas as pd
import glob

import SimpleITK as sitk

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from tensorflow.keras.utils import plot_model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

from google.colab import drive
drive.mount('/content/gdrive')

datapath = ('/content/gdrive/My Drive/DirectoryTest/All Data/')
patients = os.listdir(datapath)
labels_df = pd.read_csv('/content/Data_Index.csv', index_col = 0 )

FullDataSet = []

for patient in patients:
  a = sitk.ReadImage(datapath + patient)
  b = sitk.GetArrayFromImage(a)
  c = np.reshape(b, (189,233,197))
  FullDataSet.append(c)

labelset = []

for i in patients:
  label = labels_df.loc[i, 'Group']
  if label == 'AD':  # use `==` instead of `is` to compare strings
    labelset.append(0.)
  elif label == 'CN':
    labelset.append(1.)
  else:
      raise "Oops, unknown label" 

labelset = np.array(labelset)

x_train, x_valid, y_train, y_valid = train_test_split(FullDataSet, labelset, train_size=0.75)

## 3D CNN

CNN_model = tf.keras.Sequential(
  [
      #tf.keras.layers.Reshape([189, 233, 197, 1], input_shape=[189, 233, 197]), 
      tf.keras.layers.Input(shape =[ 189, 233, 197, 1] ),                       
      tf.keras.layers.Conv3D(kernel_size=(7, 7, 7), filters=32, activation='relu',
                          padding='same', strides=(3, 3, 3)),
      #tf.keras.layers.BatchNormalization(),
      tf.keras.layers.MaxPool3D(pool_size=(3, 3, 3), padding='same'),
      tf.keras.layers.Dropout(0.20),
      
      tf.keras.layers.Conv3D(kernel_size=(5, 5, 5), filters=64, activation='relu',
                          padding='same', strides=(3, 3, 3)),
      #tf.keras.layers.BatchNormalization(),
      tf.keras.layers.MaxPool3D(pool_size=(2, 2, 2), padding='same'),
      tf.keras.layers.Dropout(0.20),

      tf.keras.layers.Conv3D(kernel_size=(3, 3, 3), filters=128, activation='relu',
                          padding='same', strides=(1, 1, 1)),
      #tf.keras.layers.BatchNormalization(),
      tf.keras.layers.MaxPool3D(pool_size=(2, 2, 2), padding='same'),
      tf.keras.layers.Dropout(0.20), 

      # last activation could be either sigmoid or softmax, need to look into this more. Sig for binary output, Soft for multi output 
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(256, activation='relu'),   
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dropout(0.20),
      tf.keras.layers.Dense(1, activation='sigmoid')

  ])
# Compile the model
CNN_model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.00001), loss='binary_crossentropy', metrics=['accuracy'])

# print model layers
CNN_model.summary()

CNN_history = CNN_model.fit(x_train, y_train, epochs=10, validation_data=[x_valid, y_valid])

当我尝试拟合模型时,维度似乎没有对齐,我收到以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-48-c698c45a4d36> in <module>()
      1 #running of the model
      2 #CNN_history = CNN_model.fit(dataset_train, epochs=100, validation_data =dataset_test, validation_steps=1)
----> 3 CNN_history = CNN_model.fit(x_train, y_train, epochs=10, validation_data=[x_valid, y_valid], batch_size = 1)
      4 
      5 

3 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
    106   def _method_wrapper(self, *args, **kwargs):
    107     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
--> 108       return method(self, *args, **kwargs)
    109 
    110     # Running inside `run_distribute_coordinator` already.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1061           use_multiprocessing=use_multiprocessing,
   1062           model=self,
-> 1063           steps_per_execution=self._steps_per_execution)
   1064 
   1065       # Container that configures and calls `tf.keras.Callback`s.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py in __init__(self, x, y, sample_weight, batch_size, steps_per_epoch, initial_epoch, epochs, shuffle, class_weight, max_queue_size, workers, use_multiprocessing, model, steps_per_execution)
   1115         use_multiprocessing=use_multiprocessing,
   1116         distribution_strategy=ds_context.get_strategy(),
-> 1117         model=model)
   1118 
   1119     strategy = ds_context.get_strategy()

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py in __init__(self, x, y, sample_weights, sample_weight_modes, batch_size, epochs, steps, shuffle, **kwargs)
    280             label, ", ".join(str(i.shape[0]) for i in nest.flatten(data)))
    281       msg += "Please provide data which shares the same first dimension."
--> 282       raise ValueError(msg)
    283     num_samples = num_samples.pop()
    284 

ValueError: Data cardinality is ambiguous:
  x sizes: 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189
  y sizes: 15
Please provide data which shares the same first dimension.

训练拆分设置为 0.75,因此 20 个中的 15 个。我很困惑为什么这不起作用,也无法弄清楚为什么这是模型接收的输入。我以前得到过一些帮助,并使用以下代码创建一个虚拟集,结果模型将 运行:

train_size = 20
val_size = 5

X_train = np.random.random([train_size, 189, 233, 197]).astype(np.float32)
X_valid = np.random.random([val_size, 189, 233, 197]).astype(np.float32)
y_train = np.random.randint(2, size=train_size).astype(np.float32)
y_valid = np.random.randint(2, size=val_size).astype(np.float32)

为了这个问题,我已经用头撞墙好一阵子了。任何帮助将不胜感激。

我目前没有评论权限,否则我会说:

当我尝试创建一个玩具 4 维数据集,然后将其附加到列表(添加一个通道 - 我相信你已经这样做了?)时,我得到的形状不是 (dim1, dim2 , dim3, dim4, channel) 但 (channel, dim1, dim2, dim3, dim4)。我在下面包含了一个有效的例子:

import numpy as np

df = np.arange(0,625).reshape(5,5,5,5)
print(df.shape) # returns (5,5,5,5)

lst = []
lst.append(df)

print(np.asarray(g).shape) # returns (1,5,5,5,5)

据此,您的数据的形状是否可能实际上是 (1, 189, 233, 197) 而不是您预期的 (189, 233, 197, 1)?

此外,我收到的错误消息似乎暗示您没有为 X 和 y 传递相同数量的样本?

ValueError: Data cardinality is ambiguous:
  x sizes: 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189
  y sizes: 15
Please provide data which shares the same first dimension.

通常,网络的输入将具有相同的第一个大小(并以窃取您自己的玩具数据集为例,运行):

print(X_train.shape, y_train_shape, X_test.shape, y_test.shape)
# returns: (20, 189, 233, 197), (20,) (5, 189, 233, 197) (5,)

它们匹配,因为这实质上意味着每个样本都对应一个标签,反之亦然。在我看来,错误消息似乎表明每个 X 和 y 输入的第一个维度分别为 189 和 15。你能double-check输入网络之前的形状吗?