如何拆分图像数据集以使用 R 脚本进行训练和测试
How to split image dataset to train and test with R script
我有一个包含 8000 张图像的文件夹。我想知道如何将这些图像分成训练集和测试集,例如用 R 语言训练 8000 的 70% 和测试 8000 的 30%。我试过:train <- sample(nrow(trainData), 0.7*nrow(trainData), replace = FALSE)
但是好像不适合图片。我知道在 python 我们可以做到 :
`from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.15)`
但是在R语言中,我不知道。
我用提取特征函数提取图片
extract_feature <- function(dir_path, width, height, labelsExist = T) {
img_size <- width * height
## List images in path
images_names <- list.files(dir_path)
if(labelsExist){
## Select only cats or dogs images
catdog <- str_extract(images_names, "^(cat|dog)")
# Set cat == 0 and dog == 1
key <- c("cat" = 0, "dog" = 1)
y <- key[catdog]
}
print(paste("Start processing", length(images_names), "images"))
## This function will resize an image, turn it into greyscale
feature_list <- pblapply(images_names, function(imgname) {
## Read image
img <- readImage(file.path(dir_path, imgname))
## Resize image
img_resized <- resize(img, w = width, h = height)
## Set to grayscale (normalized to max)
grayimg <- channel(img_resized, "gray")
## Get the image as a matrix
img_matrix <- grayimg@.Data
## Coerce to a vector (row-wise)
img_vector <- as.vector(t(img_matrix))
return(img_vector)
})
## bind the list of vector into matrix
feature_matrix <- do.call(rbind, feature_list)
feature_matrix <- as.data.frame(feature_matrix)
## Set names
names(feature_matrix) <- paste0("pixel", c(1:img_size))
if(labelsExist){
return(list(X = feature_matrix, y = y))
}else{
return(feature_matrix)
}
}
# Takes approx. 15min
trainData <- extract_feature("/run/media/dogvscat/dogs-vs-cats/", width, height)
我想将此 trainData 拆分为火车 (70%) 和验证 30%
特征和像素在列表中,因此您必须分别提取特征和标签。首先,您使用 sample
创建一个从 1 到图像数量 (8000) 的随机数字向量。您将使用向量以相同的顺序提取相同的随机像素向量和标签。您不仅可以在 trainData
列表和 feature_matrix
数据框以及 trainData
列表内的 y
向量上使用 []
运算符。
# simulated example
# 8000 images in rows with 100 pixels per image
# we have to select images row wise
feature_matrix <- matrix(rnorm(800000), ncol = 100)
feature_matrix <- as.data.frame(feature_matrix)
# simulated labels
y = sample(0:1, size = 8000, replace = T)
trainData <- list(X = feature_matrix, y = y)
# random vector for image selection
# optional - you can add set.seed for reproducibility
set.seed(77)
rnd = sample(1:8000, size = 0.7*8000)
# select images and labels seperately in list
train = list(trainData$X[rnd,], trainData$y[rnd])
test = list(trainData$X[-rnd,], trainData$y[-rnd])
我有一个包含 8000 张图像的文件夹。我想知道如何将这些图像分成训练集和测试集,例如用 R 语言训练 8000 的 70% 和测试 8000 的 30%。我试过:train <- sample(nrow(trainData), 0.7*nrow(trainData), replace = FALSE)
但是好像不适合图片。我知道在 python 我们可以做到 :
`from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.15)`
但是在R语言中,我不知道。
我用提取特征函数提取图片
extract_feature <- function(dir_path, width, height, labelsExist = T) {
img_size <- width * height
## List images in path
images_names <- list.files(dir_path)
if(labelsExist){
## Select only cats or dogs images
catdog <- str_extract(images_names, "^(cat|dog)")
# Set cat == 0 and dog == 1
key <- c("cat" = 0, "dog" = 1)
y <- key[catdog]
}
print(paste("Start processing", length(images_names), "images"))
## This function will resize an image, turn it into greyscale
feature_list <- pblapply(images_names, function(imgname) {
## Read image
img <- readImage(file.path(dir_path, imgname))
## Resize image
img_resized <- resize(img, w = width, h = height)
## Set to grayscale (normalized to max)
grayimg <- channel(img_resized, "gray")
## Get the image as a matrix
img_matrix <- grayimg@.Data
## Coerce to a vector (row-wise)
img_vector <- as.vector(t(img_matrix))
return(img_vector)
})
## bind the list of vector into matrix
feature_matrix <- do.call(rbind, feature_list)
feature_matrix <- as.data.frame(feature_matrix)
## Set names
names(feature_matrix) <- paste0("pixel", c(1:img_size))
if(labelsExist){
return(list(X = feature_matrix, y = y))
}else{
return(feature_matrix)
}
}
# Takes approx. 15min
trainData <- extract_feature("/run/media/dogvscat/dogs-vs-cats/", width, height)
我想将此 trainData 拆分为火车 (70%) 和验证 30%
特征和像素在列表中,因此您必须分别提取特征和标签。首先,您使用 sample
创建一个从 1 到图像数量 (8000) 的随机数字向量。您将使用向量以相同的顺序提取相同的随机像素向量和标签。您不仅可以在 trainData
列表和 feature_matrix
数据框以及 trainData
列表内的 y
向量上使用 []
运算符。
# simulated example
# 8000 images in rows with 100 pixels per image
# we have to select images row wise
feature_matrix <- matrix(rnorm(800000), ncol = 100)
feature_matrix <- as.data.frame(feature_matrix)
# simulated labels
y = sample(0:1, size = 8000, replace = T)
trainData <- list(X = feature_matrix, y = y)
# random vector for image selection
# optional - you can add set.seed for reproducibility
set.seed(77)
rnd = sample(1:8000, size = 0.7*8000)
# select images and labels seperately in list
train = list(trainData$X[rnd,], trainData$y[rnd])
test = list(trainData$X[-rnd,], trainData$y[-rnd])