我上传到 Activeloop Hub 的速度很慢。如何使 Hub 数据集上传速度更快?

My uploading to Activeloop Hub is slow. How to make Hub dataset uploading faster?

我正在使用 places365(resized) 数据集。这是一个分类数据集,包含大约 270 万张图像,大小为 131GB。

我正在尝试将此数据集上传到 Hub——the dataset format for AI——数据集的上传速度约为 5MB/s。这样做之后,我能够加载数据集并且那里有大约 240 万张图像。

是否可以加快上传过程?

我使用以下代码尝试上传数据集:

import hub
import numpy as np
from PIL import Image
import argparse
import tqdm
import time

import traceback
import sys

import logging

import torchvision.datasets as datasets

NUM_WORKERS = 1
DS_OUT_PATH = "./data/places365"  # optionally s3://, gcs:// or hub:// path
DOWNLOAD = False
splits = [
    "train-standard",
    # "val",
    # "train-challenge"
]

parser = argparse.ArgumentParser(description="Hub Places365 Uploading")
parser.add_argument("data", metavar="DIR", help="path to dataset")
parser.add_argument(
    "--num_workers",
    type=int,
    default=NUM_WORKERS,
    metavar="O",
    help="number of workers to allocate",
)
parser.add_argument(
    "--ds_out",
    type=str,
    default=DS_OUT_PATH,
    metavar="O",
    help="dataset path to be transformed into",
)

parser.add_argument(
    "--download",
    type=bool,
    default=DOWNLOAD,
    metavar="O",
    help="Download from the source http://places2.csail.mit.edu/download.html",
)

args = parser.parse_args()


def define_dataset(path: str, class_names: list = []):
    ds = hub.empty(path, overwrite=True)

    ds.create_tensor("images", htype="image", sample_compression="jpg")
    ds.create_tensor("labels", htype="class_label", class_names=class_names)

    return ds


@hub.compute
def upload_parallel(pair_in, sample_out):
    filepath, target = pair_in[0], pair_in[1]
    try:
        img = Image.open(filepath)
        if len(img.size) == 2:
            img = img.convert("RGB")
        arr = np.asarray(img)
        sample_out.images.append(arr)
        sample_out.labels.append(target)
    except Exception as e:
        logging.error(f"failed uploading {filepath} with target {target}")


def upload_iteration(filenames_target: list, ds: hub.Dataset):
    with ds:
        for filepath, target in tqdm.tqdm(filenames_target):
            try:
                img = Image.open(filepath)
                if len(img.size) == 2:
                    img = img.convert("RGB")
                arr = np.asarray(img)
                ds.images.append(arr)
                ds.labels.append(target)
            except Exception as e:
                logging.error(f"failed uploading {filepath} with target {target}")


if __name__ == "__main__":

    for split in splits:
        torch_dataset = datasets.Places365(
            args.data,
            split=split,
            download=args.download,
        )
        categories = torch_dataset.load_categories()[0]
        categories = list(map(lambda x: "/".join(x.split("/")[2:]), categories))
        ds = define_dataset(f"{args.ds_out}-{split}", categories)
        filenames_target = torch_dataset.load_file_list()

        print(f"uploading {split}...")
        t1 = time.time()
        if args.num_workers > 1:

            upload_parallel().eval(
                filenames_target[0],
                ds,
                num_workers=args.num_workers,
                scheduler="processed",
            )
        else:
            upload_iteration(filenames_target[0], ds)
        t2 = time.time()
        print(f"uploading {split} took {t2-t1}s")

我正在使用 Hub v2.2.2

您体验到的速度符合预期。通常,Hub 可以在 ~10-15 MB/s single-threaded 上传数据集。看起来你 运行 宁在 ~5MB/s,这大致在同一个球场。如果您想要 运行 multi-threaded,您可以查看 Places305 GitHub example page 上脚本中的 upload_parallel 函数中的方法。它使用多处理来加快速度。

顺便说一句,您还可以 visualize Places365 on Activeloop Platform