从现有存储桶为 AutoML Vision 生成 CSV 导入文件

Generate CSV import file for AutoML Vision from an existing bucket

我已经有一个按标签划分的 GCloud 存储桶,如下所示:

gs://my_bucket/dataset/label1/
gs://my_bucket/dataset/label2/
...

每个标签文件夹里面都有照片。我想生成所需的 CSV – as explained here – 但我不知道如何以编程方式执行此操作,考虑到我在每个文件夹中有数百张照片。 CSV 文件应如下所示:

gs://my_bucket/dataset/label1/photo1.jpg,label1
gs://my_bucket/dataset/label1/photo12.jpg,label1
gs://my_bucket/dataset/label2/photo7.jpg,label2
...

您需要列出数据集文件夹内的所有文件及其完整路径,然后对其进行解析以获取包含该文件的文件夹的名称,因为在您的情况下,这是您要使用的标签。这可以通过几种不同的方式完成。我将包含两个示例,您可以根据这些示例编写代码:

Gsutil有一个method that lists bucket contents,那么你可以用bash脚本解析字符串:

 # Create csv file and define bucket path
bucket_path="gs://buckbuckbuckbuck/dataset/"
filename="labels_csv_bash.csv"
touch $filename

IFS=$'\n' # Internal field separator variable has to be set to separate on new lines

# List of every .jpg file inside the buckets folder. ** searches for them recursively.
for i in `gsutil ls $bucket_path**.jpg`
do
        # Cuts the address using the / limiter and gets the second item starting from the end.
        label=$(echo $i | rev | cut -d'/' -f2 | rev)
        echo "$i, $label" >> $filename
done

IFS=' ' # Reset to originnal value

gsutil cp $filename $bucket_path

也可以使用为不同语言提供的 Google Cloud Client libraries 来完成。这里有一个使用 python:

的例子
# Imports the Google Cloud client library
import os
from google.cloud import storage

# Instantiates a client
storage_client = storage.Client()

# The name for the new bucket
bucket_name = 'my_bucket'
path_in_bucket = 'dataset'

blobs = storage_client.list_blobs(bucket_name, prefix=path_in_bucket)

# Reading blobs, parsing information and creating the csv file
filename = 'labels_csv_python.csv'
with open(filename, 'w+') as f:
    for blob in blobs:
        if '.jpg' in blob.name:
            bucket_path = 'gs://' + os.path.join(bucket_name, blob.name)
            label = blob.name.split('/')[-2]
            f.write(', '.join([bucket_path, label]))
            f.write("\n")

# Uploading csv file to the bucket
bucket = storage_client.get_bucket(bucket_name)
destination_blob_name = os.path.join(path_in_bucket, filename)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(filename)

对于那些像我一样正在寻找一种方法来创建 .csv 文件以在 googleAutoML 中进行批处理但不需要标签列的人:

# Create csv file and define bucket path
bucket_path="gs:YOUR_BUCKET/FOLDER"
filename="THE_FILENAME_YOU_WANT.csv"
touch $filename

IFS=$'\n' # Internal field separator variable has to be set to separate on new lines

# List of every [YOUREXTENSION] file inside the buckets folder - change in next line - ie **.png beceomes **.your_extension. ** searches for them recursively.
for i in `gsutil ls $bucket_path**.png`
do

       echo "$i" >> $filename
done

IFS=' ' # Reset to originnal value

gsutil cp $filename $bucket_path