如何拆分图像数据集以使用 R 脚本进行训练和测试

Question

我有一个包含 8000 张图像的文件夹。我想知道如何将这些图像分成训练集和测试集，例如用 R 语言训练 8000 的 70% 和测试 8000 的 30%。我试过：train <- sample(nrow(trainData), 0.7*nrow(trainData), replace = FALSE) 但是好像不适合图片。我知道在 python 我们可以做到 :

`from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.15)`

但是在R语言中，我不知道。

我用提取特征函数提取图片

extract_feature <- function(dir_path, width, height, labelsExist = T) {
  img_size <- width * height
  
  ## List images in path
  images_names <- list.files(dir_path)
  
  if(labelsExist){
    ## Select only cats or dogs images
    catdog <- str_extract(images_names, "^(cat|dog)")
    # Set cat == 0 and dog == 1
    key <- c("cat" = 0, "dog" = 1)
    y <- key[catdog]
  }
  
  print(paste("Start processing", length(images_names), "images"))
  ## This function will resize an image, turn it into greyscale
  feature_list <- pblapply(images_names, function(imgname) {
    ## Read image
    img <- readImage(file.path(dir_path, imgname))
    ## Resize image
    img_resized <- resize(img, w = width, h = height)
    ## Set to grayscale (normalized to max)
    grayimg <- channel(img_resized, "gray")
    ## Get the image as a matrix
    img_matrix <- grayimg@.Data
    ## Coerce to a vector (row-wise)
    img_vector <- as.vector(t(img_matrix))
    return(img_vector)
  })
  ## bind the list of vector into matrix
  feature_matrix <- do.call(rbind, feature_list)
  feature_matrix <- as.data.frame(feature_matrix)
  ## Set names
  names(feature_matrix) <- paste0("pixel", c(1:img_size))
  
  if(labelsExist){
    return(list(X = feature_matrix, y = y))
  }else{
    return(feature_matrix)
  }
}

# Takes approx. 15min
trainData <- extract_feature("/run/media/dogvscat/dogs-vs-cats/", width, height)

我想将此 trainData 拆分为火车 (70%) 和验证 30%

Answer 1

特征和像素在列表中，因此您必须分别提取特征和标签。首先，您使用 sample 创建一个从 1 到图像数量 (8000) 的随机数字向量。您将使用向量以相同的顺序提取相同的随机像素向量和标签。您不仅可以在 trainData 列表和 feature_matrix 数据框以及 trainData 列表内的 y 向量上使用 [] 运算符。

# simulated example
# 8000 images in rows with 100 pixels per image
# we have to select images row wise
feature_matrix <- matrix(rnorm(800000), ncol = 100)
feature_matrix <- as.data.frame(feature_matrix)

# simulated labels
y = sample(0:1, size = 8000, replace = T)
trainData <- list(X = feature_matrix, y = y)

# random vector for image selection
# optional - you can add set.seed for reproducibility 
set.seed(77)
rnd = sample(1:8000, size = 0.7*8000)

# select images and labels seperately in list
train = list(trainData$X[rnd,], trainData$y[rnd])
test = list(trainData$X[-rnd,], trainData$y[-rnd])

如何拆分图像数据集以使用 R 脚本进行训练和测试

How to split image dataset to train and test with R script

split

r

dataset

keras

tensorflow-datasets