在 TensorFlow 中从太多图像创建数据集

Question

我无法找到一种方法来从图像在 tensorflow 中创建数据集。我的数据集具有以下结构：

    fruit-data
    |
    |-train
    |   |
    |   |- Freshapple -> .png images of fresh apples
    |   |- Freshorange -> .png images of fresh oranges
    |   |- Freshbanana -> .png images of fresh bananas
    |   
    |-test
    |   |
    |   |- Rottenapple -> .png images of rotten apples
    |   |- Rottenorange -> png images of rotten oranges
    |   |- Rottenbanana -> .png images of rotten bananas
    |

我的路径是这样设置的类设置：

    train_path = ".../Desktop/Data/fruit-dataset/train"
    test_path = ".../Desktop/Data/fruit-dataset/train"
    categories = ["freshapple", "freshorange", "freshbanana",
                  "rottenapple", "rottenorange", "rottenbanana"]

根据我看到的其他资源，因为我的数据集包含超过 13k 个图像，我需要使用 flow_from_directory()，因为加载到内存会导致运行时崩溃。

我对接下来要加载此数据集的步骤感到困惑。

关于其他信息，我将使用经过调整的 MobilenetV2 模型。（试验冷冻层）

Answer 1

加载数据的方法有很多种。我更喜欢使用 pandas 数据帧，因为它很容易以各种方式对数据进行分区。下面的代码应该是你需要的

sdir = r'.../Desktop/Data/fruit-dataset'
categories=['train', 'test']
for category in categories:
    catpath=os.path.join(sdir, category)
    classlist=os.listdir(catpath)    
    filepaths=[]
    labels=[]    
    for klass in classlist:
        classpath=os.path.join(catpath,klass)        
        flist=os.listdir(classpath)        
        for f in flist:
            fpath=os.path.join(classpath,f)        
            filepaths.append(fpath)
            labels.append(klass) 
    Fseries=pd.Series(filepaths, name='filepaths')
    Lseries=pd.Series(labels, name='labels')
    if category=='train':
        df=pd.concat([Fseries, Lseries], axis=1)
    else:
        test_df=pd.concat([Fseries, Lseries], axis=1)
# create a validation data set
train_df, valid_df=train_test_split(df, train_size=.8, shuffle=True, random_state=123)
print('train_df length: ', len(train_df), '  test_df length: ',len(test_df), '  valid_df length: ', len(valid_df))
balance=list(train_df['labels'].value_counts())
# check the balance of the training set
for b in balance:
    print (b)
height=224
width=224
channels=3
batch_size=40
img_shape=(height, width, channels)
img_size=(height, width)
length=len(test_df)
test_batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=80],reverse=True)[0]  
test_steps=int(length/test_batch_size)
print ( 'test batch size: ' ,test_batch_size, '  test steps: ', test_steps)
def scalar(img):
    img=img/255
    return img 
trgen=ImageDataGenerator(preprocessing_function=scalar, horizontal_flip=True)
tvgen=ImageDataGenerator(preprocessing_function=scalar)
train_gen=trgen.flow_from_dataframe( train_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
                                    color_mode='rgb', shuffle=True, batch_size=batch_size)
test_gen=tvgen.flow_from_dataframe( test_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
                                    color_mode='rgb', shuffle=False, batch_size=test_batch_size)
valid_gen=tvgen.flow_from_dataframe( valid_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
                                    color_mode='rgb', shuffle=True, batch_size=batch_size)
classes=list(train_gen.class_indices.keys())
class_count=len(classes)
history=model.fit(x=train_gen,  epochs=20, verbose=2,  validation_data=valid_gen,
               validation_steps=None,  shuffle=False,  initial_epoch=0)

或者更简单但通用性较低的方法是 flow_from_directory

gen=tf.keras.preprocessing.image.ImageDataGenerator( rescale=1/255, 
                                                     validation_split=0.1)
tgen=tf.keras.preprocessing.image.ImageDataGenerator( rescale=1/255)
train_dir=r'.../Desktop/Data/fruit-dataset/train'
train_gen=gen.flow_from_directoy(train_dir, target_size=(256, 256),
                    class_mode="categorical",  batch_size=32, shuffle=True,
                    seed=123,  subset='training)
valid_gen=gen.flow_from_directory(train_dir, target_size=(256, 256),
                    class_mode="categorical",  batch_size=32, shuffle=True,
                    seed=123,  subset='validation')
test_dir=r'.../Desktop/Data/fruit-dataset/test' # you had this wrong in your code
test_gen=tgen.flow_from_directory(test_dir, target_size=(256, 256),
                    class_mode="categorical",  batch_size=32, shuffle=False)
history=model.fit(x=train_gen,  epochs=20, verbose=2,  validation_data=valid_gen,
               validation_steps=None,  shuffle=False,  initial_epoch=0)

在 TensorFlow 中从太多图像创建数据集

Create a dataset from too many images in TensorFlow

tensorflow

tensorflow-datasets