将图像数据集拆分为训练测试数据集
Split image dataset into train-test datasets
所以我有一个包含子文件夹的主文件夹,子文件夹又包含数据集的图像,如下所示。
-main_db
---CLASS_1
-----img_1
-----img_2
-----img_3
-----img_4
---CLASS_2
-----img_1
-----img_2
-----img_3
-----img_4
---CLASS_3
-----img_1
-----img_2
-----img_3
-----img_4
我需要将这个数据集分成两部分,即训练数据 (70%) 和测试数据 (30%)。下面是我想要实现的层级
-main_db
---training_data
-----CLASS_1
-------img_1
-------img_2
-------img_3
-------img_4
---CLASS_2
-------img_1
-------img_2
-------img_3
-------img_4
---testing_data
-----CLASS_1
-------img_5
-------img_6
-------img_7
-------img_8
---CLASS_2
-------img_5
-------img_6
-------img_7
-------img_8
感谢任何帮助。谢谢
我试过这个模块。但这对我不起作用。根本没有导入此模块。
https://github.com/jfilter/split-folders
这正是我想要的。
这应该可以做到。它会计算每个文件夹中有多少张图片,然后进行相应的拆分,将测试数据保存在具有相同结构的不同文件夹中。
将代码保存在 main.py
文件和 运行 命令中:
python3 main.py ----data_path=/path1 --test_data_path_to_save=/path2 --train_ratio=0.7
import shutil
import os
import numpy as np
import argparse
def get_files_from_folder(path):
files = os.listdir(path)
return np.asarray(files)
def main(path_to_data, path_to_test_data, train_ratio):
# get dirs
_, dirs, _ = next(os.walk(path_to_data))
# calculates how many train data per class
data_counter_per_class = np.zeros((len(dirs)))
for i in range(len(dirs)):
path = os.path.join(path_to_data, dirs[i])
files = get_files_from_folder(path)
data_counter_per_class[i] = len(files)
test_counter = np.round(data_counter_per_class * (1 - train_ratio))
# transfers files
for i in range(len(dirs)):
path_to_original = os.path.join(path_to_data, dirs[i])
path_to_save = os.path.join(path_to_test_data, dirs[i])
#creates dir
if not os.path.exists(path_to_save):
os.makedirs(path_to_save)
files = get_files_from_folder(path_to_original)
# moves data
for j in range(int(test_counter[i])):
dst = os.path.join(path_to_save, files[j])
src = os.path.join(path_to_original, files[j])
shutil.move(src, dst)
def parse_args():
parser = argparse.ArgumentParser(description="Dataset divider")
parser.add_argument("--data_path", required=True,
help="Path to data")
parser.add_argument("--test_data_path_to_save", required=True,
help="Path to test data where to save")
parser.add_argument("--train_ratio", required=True,
help="Train ratio - 0.7 means splitting data in 70 % train and 30 % test")
return parser.parse_args()
if __name__ == "__main__":
args = parse_args()
main(args.data_path, args.test_data_path_to_save, float(args.train_ratio))
如果您不太热衷于编码,可以使用一个名为 split-folders 的 python 包。它非常易于使用,可以找到 here
下面是它的使用方法。
pip install split-folders
import split_folders # or import splitfolders
input_folder = "/path/to/input/folder"
output = "/path/to/output/folder" #where you want the split datasets saved. one will be created if it does not exist or none is set
split_folders.ratio(input_folder, output=output, seed=42, ratio=(.8, .1, .1)) # ratio of split are in order of train/val/test. You can change to whatever you want. For train/val sets only, you could do .75, .25 for example.
但是,我强烈建议对上面提供的答案进行编码,因为它们可以帮助您学习。
如果您查看他们的文档 here,他们已经更新了语法。基本上,我遇到了类似的问题,但我发现以下新语法按照那里的更新工作:
import splitfolders # or import split_folders
splitfolders.ratio("input_folder", output="output", seed=1337, ratio=(.8, .1, .1),
group_prefix=None) # default values
# Split with a ratio.
#To only split into training and validation set, set a tuple to `ratio`, i.e,`(.8,
# .2)`.
splitfolders.ratio("input_folder", output="output", seed=1337, ratio=(.8, .1, .1),
group_prefix=None) # default values
# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`,
i.e.,
# `10`.
splitfolders.fixed("input_folder", output="output", seed=1337, fixed=(100, 100),
oversample=False, group_prefix=None) # default values
** 访问此 link https://www.kaggle.com/questions-and-answers/102677 感谢 Kaggle 上的“saravanansaminathan”评论 对于我的数据集上的相同问题,具有以下文件夹结构。
/TT拆分
/0
/001_01.jpg
…………
/1
/001_04.jpg
…………
以上link作为参考,我确实遵循了。**
import os
import numpy as np
import shutil
import random
root_dir = '/home/dipak/Desktop/TTSplit/'
classes_dir = ['0', '1']
test_ratio = 0.20
for cls in classes_dir:
os.makedirs(root_dir +'train/' + cls)
os.makedirs(root_dir +'test/' + cls)
src = root_dir + cls
allFileNames = os.listdir(src)
np.random.shuffle(allFileNames)
train_FileNames, test_FileNames = np.split(np.array(allFileNames),
[int(len(allFileNames)* (1 - test_ratio))])
train_FileNames = [src+'/'+ name for name in train_FileNames.tolist()]
test_FileNames = [src+'/' + name for name in test_FileNames.tolist()]
print("*****************************")
print('Total images: ', len(allFileNames))
print('Training: ', len(train_FileNames))
print('Testing: ', len(test_FileNames))
print("*****************************")
lab = ['0', '1']
for name in train_FileNames:
for i in lab:
shutil.copy(name, root_dir +'train/' + i)
for name in test_FileNames:
for i in lab:
shutil.copy(name, root_dir +'test/' + i)
print("Copying Done!")
data = os.listdir(image_directory)
from sklearn.model_selection import train_test_split
train, valid = train_test_split(data, test_size=0.2, random_state=1)
然后您可以使用shutil
将图像复制到您想要的文件夹
这个呢?
from pathlib import Path
from sklearn.model_selection import StratifiedShuffleSplit
import shutil
def image_train_test_split(path, fmt, train_size):
train_folder = Path('train')
test_folder = Path('test')
train_folder.mkdir(exist_ok=True)
test_folder.mkdir(exist_ok=True)
data_path = Path(path)
data = []
for d in data_path.glob('*'):
for f in d.glob(f'*.{fmt}'):
data.append([f, d.stem])
data = np.array(data)
ss = StratifiedShuffleSplit(1, train_size=0.8)
train_ix, test_ix = next(ss.split(data[:,0], data[:,1]))
train_set, test_set = data[train_ix], data[test_ix]
for p, c in train_set:
(train_folder / c).mkdir(exist_ok=True)
shutil.move(p, train_folder.joinpath(*p.parts[-2:]))
for p, c in test_set:
(test_folder / c).mkdir(exist_ok=True)
shutil.move(p, test_folder.joinpath(*p.parts[-2:]))
我需要类似@Dipendra Pant 的想法,但他的代码对我不起作用。我认为它在 for 循环中有一些识别错误。无论如何,强烈基于他的回答,这是对我有用的解决方案:它从一个包含 5 个子文件夹(我的 5 类)的文件夹中读取,并将所有内容保存在 3 个文件夹中(train_ds、test_ds, val_ds), 每个人里面有5个子文件夹,刚好可以使用 image_dataset_from_directory with shuffle= False (改组已经在此代码中完成)。
import os
import numpy as np
import shutil
import random
root_dir = base_folder+"input/House_Room_Dataset-5_rooms/" # for requesting directly pics
classes_dir = os.listdir(root_dir)
train_ratio = 0.6
val_ratio = 0.1
for cls in classes_dir:
os.makedirs(input_destination +'train_ds/' + cls, exist_ok=True)
os.makedirs(input_destination +'test_ds/' + cls, exist_ok=True)
os.makedirs(input_destination +'val_ds/' + cls, exist_ok=True)
# for each class, let's counts its elements
src = root_dir + cls
allFileNames = os.listdir(src)
# shuffle it and split into train/test/va
np.random.shuffle(allFileNames)
train_FileNames, test_FileNames, val_FileNames = np.split(np.array(allFileNames),[int(train_ratio * len(allFileNames)), int((1-val_ratio) * len(allFileNames))])
# save their initial path
train_FileNames = [src+'/'+ name for name in train_FileNames.tolist()]
test_FileNames = [src+'/' + name for name in test_FileNames.tolist()]
val_FileNames = [src+'/' + name for name in val_FileNames.tolist()]
print("\n *****************************",
"\n Total images: ",cls, len(allFileNames),
'\n Training: ', len(train_FileNames),
'\n Testing: ', len(test_FileNames),
'\n Validation: ', len(val_FileNames),
'\n *****************************')
# copy files from the initial path to the final folders
for name in train_FileNames:
shutil.copy(name, input_destination +'train_ds/' + cls)
for name in test_FileNames:
shutil.copy(name, input_destination +'test_ds/' + cls)
for name in val_FileNames:
shutil.copy(name, input_destination +'val_ds/' + cls)
# checking everything was fine
paths = ['train_ds/', 'test_ds/','val_ds/']
for p in paths:
for dir,subdir,files in os.walk(input_destination + p):
print(dir,' ', p, str(len(files)))
所以我有一个包含子文件夹的主文件夹,子文件夹又包含数据集的图像,如下所示。
-main_db
---CLASS_1
-----img_1
-----img_2
-----img_3
-----img_4
---CLASS_2
-----img_1
-----img_2
-----img_3
-----img_4
---CLASS_3
-----img_1
-----img_2
-----img_3
-----img_4
我需要将这个数据集分成两部分,即训练数据 (70%) 和测试数据 (30%)。下面是我想要实现的层级
-main_db
---training_data
-----CLASS_1
-------img_1
-------img_2
-------img_3
-------img_4
---CLASS_2
-------img_1
-------img_2
-------img_3
-------img_4
---testing_data
-----CLASS_1
-------img_5
-------img_6
-------img_7
-------img_8
---CLASS_2
-------img_5
-------img_6
-------img_7
-------img_8
感谢任何帮助。谢谢
我试过这个模块。但这对我不起作用。根本没有导入此模块。
https://github.com/jfilter/split-folders
这正是我想要的。
这应该可以做到。它会计算每个文件夹中有多少张图片,然后进行相应的拆分,将测试数据保存在具有相同结构的不同文件夹中。
将代码保存在 main.py
文件和 运行 命令中:
python3 main.py ----data_path=/path1 --test_data_path_to_save=/path2 --train_ratio=0.7
import shutil
import os
import numpy as np
import argparse
def get_files_from_folder(path):
files = os.listdir(path)
return np.asarray(files)
def main(path_to_data, path_to_test_data, train_ratio):
# get dirs
_, dirs, _ = next(os.walk(path_to_data))
# calculates how many train data per class
data_counter_per_class = np.zeros((len(dirs)))
for i in range(len(dirs)):
path = os.path.join(path_to_data, dirs[i])
files = get_files_from_folder(path)
data_counter_per_class[i] = len(files)
test_counter = np.round(data_counter_per_class * (1 - train_ratio))
# transfers files
for i in range(len(dirs)):
path_to_original = os.path.join(path_to_data, dirs[i])
path_to_save = os.path.join(path_to_test_data, dirs[i])
#creates dir
if not os.path.exists(path_to_save):
os.makedirs(path_to_save)
files = get_files_from_folder(path_to_original)
# moves data
for j in range(int(test_counter[i])):
dst = os.path.join(path_to_save, files[j])
src = os.path.join(path_to_original, files[j])
shutil.move(src, dst)
def parse_args():
parser = argparse.ArgumentParser(description="Dataset divider")
parser.add_argument("--data_path", required=True,
help="Path to data")
parser.add_argument("--test_data_path_to_save", required=True,
help="Path to test data where to save")
parser.add_argument("--train_ratio", required=True,
help="Train ratio - 0.7 means splitting data in 70 % train and 30 % test")
return parser.parse_args()
if __name__ == "__main__":
args = parse_args()
main(args.data_path, args.test_data_path_to_save, float(args.train_ratio))
如果您不太热衷于编码,可以使用一个名为 split-folders 的 python 包。它非常易于使用,可以找到 here 下面是它的使用方法。
pip install split-folders
import split_folders # or import splitfolders
input_folder = "/path/to/input/folder"
output = "/path/to/output/folder" #where you want the split datasets saved. one will be created if it does not exist or none is set
split_folders.ratio(input_folder, output=output, seed=42, ratio=(.8, .1, .1)) # ratio of split are in order of train/val/test. You can change to whatever you want. For train/val sets only, you could do .75, .25 for example.
但是,我强烈建议对上面提供的答案进行编码,因为它们可以帮助您学习。
如果您查看他们的文档 here,他们已经更新了语法。基本上,我遇到了类似的问题,但我发现以下新语法按照那里的更新工作:
import splitfolders # or import split_folders
splitfolders.ratio("input_folder", output="output", seed=1337, ratio=(.8, .1, .1),
group_prefix=None) # default values
# Split with a ratio.
#To only split into training and validation set, set a tuple to `ratio`, i.e,`(.8,
# .2)`.
splitfolders.ratio("input_folder", output="output", seed=1337, ratio=(.8, .1, .1),
group_prefix=None) # default values
# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`,
i.e.,
# `10`.
splitfolders.fixed("input_folder", output="output", seed=1337, fixed=(100, 100),
oversample=False, group_prefix=None) # default values
** 访问此 link https://www.kaggle.com/questions-and-answers/102677 感谢 Kaggle 上的“saravanansaminathan”评论 对于我的数据集上的相同问题,具有以下文件夹结构。 /TT拆分 /0 /001_01.jpg ………… /1 /001_04.jpg ………… 以上link作为参考,我确实遵循了。**
import os
import numpy as np
import shutil
import random
root_dir = '/home/dipak/Desktop/TTSplit/'
classes_dir = ['0', '1']
test_ratio = 0.20
for cls in classes_dir:
os.makedirs(root_dir +'train/' + cls)
os.makedirs(root_dir +'test/' + cls)
src = root_dir + cls
allFileNames = os.listdir(src)
np.random.shuffle(allFileNames)
train_FileNames, test_FileNames = np.split(np.array(allFileNames),
[int(len(allFileNames)* (1 - test_ratio))])
train_FileNames = [src+'/'+ name for name in train_FileNames.tolist()]
test_FileNames = [src+'/' + name for name in test_FileNames.tolist()]
print("*****************************")
print('Total images: ', len(allFileNames))
print('Training: ', len(train_FileNames))
print('Testing: ', len(test_FileNames))
print("*****************************")
lab = ['0', '1']
for name in train_FileNames:
for i in lab:
shutil.copy(name, root_dir +'train/' + i)
for name in test_FileNames:
for i in lab:
shutil.copy(name, root_dir +'test/' + i)
print("Copying Done!")
data = os.listdir(image_directory)
from sklearn.model_selection import train_test_split
train, valid = train_test_split(data, test_size=0.2, random_state=1)
然后您可以使用shutil
将图像复制到您想要的文件夹
这个呢?
from pathlib import Path
from sklearn.model_selection import StratifiedShuffleSplit
import shutil
def image_train_test_split(path, fmt, train_size):
train_folder = Path('train')
test_folder = Path('test')
train_folder.mkdir(exist_ok=True)
test_folder.mkdir(exist_ok=True)
data_path = Path(path)
data = []
for d in data_path.glob('*'):
for f in d.glob(f'*.{fmt}'):
data.append([f, d.stem])
data = np.array(data)
ss = StratifiedShuffleSplit(1, train_size=0.8)
train_ix, test_ix = next(ss.split(data[:,0], data[:,1]))
train_set, test_set = data[train_ix], data[test_ix]
for p, c in train_set:
(train_folder / c).mkdir(exist_ok=True)
shutil.move(p, train_folder.joinpath(*p.parts[-2:]))
for p, c in test_set:
(test_folder / c).mkdir(exist_ok=True)
shutil.move(p, test_folder.joinpath(*p.parts[-2:]))
我需要类似@Dipendra Pant 的想法,但他的代码对我不起作用。我认为它在 for 循环中有一些识别错误。无论如何,强烈基于他的回答,这是对我有用的解决方案:它从一个包含 5 个子文件夹(我的 5 类)的文件夹中读取,并将所有内容保存在 3 个文件夹中(train_ds、test_ds, val_ds), 每个人里面有5个子文件夹,刚好可以使用 image_dataset_from_directory with shuffle= False (改组已经在此代码中完成)。
import os
import numpy as np
import shutil
import random
root_dir = base_folder+"input/House_Room_Dataset-5_rooms/" # for requesting directly pics
classes_dir = os.listdir(root_dir)
train_ratio = 0.6
val_ratio = 0.1
for cls in classes_dir:
os.makedirs(input_destination +'train_ds/' + cls, exist_ok=True)
os.makedirs(input_destination +'test_ds/' + cls, exist_ok=True)
os.makedirs(input_destination +'val_ds/' + cls, exist_ok=True)
# for each class, let's counts its elements
src = root_dir + cls
allFileNames = os.listdir(src)
# shuffle it and split into train/test/va
np.random.shuffle(allFileNames)
train_FileNames, test_FileNames, val_FileNames = np.split(np.array(allFileNames),[int(train_ratio * len(allFileNames)), int((1-val_ratio) * len(allFileNames))])
# save their initial path
train_FileNames = [src+'/'+ name for name in train_FileNames.tolist()]
test_FileNames = [src+'/' + name for name in test_FileNames.tolist()]
val_FileNames = [src+'/' + name for name in val_FileNames.tolist()]
print("\n *****************************",
"\n Total images: ",cls, len(allFileNames),
'\n Training: ', len(train_FileNames),
'\n Testing: ', len(test_FileNames),
'\n Validation: ', len(val_FileNames),
'\n *****************************')
# copy files from the initial path to the final folders
for name in train_FileNames:
shutil.copy(name, input_destination +'train_ds/' + cls)
for name in test_FileNames:
shutil.copy(name, input_destination +'test_ds/' + cls)
for name in val_FileNames:
shutil.copy(name, input_destination +'val_ds/' + cls)
# checking everything was fine
paths = ['train_ds/', 'test_ds/','val_ds/']
for p in paths:
for dir,subdir,files in os.walk(input_destination + p):
print(dir,' ', p, str(len(files)))