如何使用 'gsutil' 复制文件夹?
How to copy folders with 'gsutil'?
我已经阅读了有关 gsutil cp
命令的文档,但仍然不明白如何复制文件夹以保持相同的权限。我试过这个命令:
gsutil cp gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder
但结果出现错误:
CommandException: No URLs matched
不过,当我尝试在每个名字的末尾加上斜杠时,它没有显示任何错误:
gsutil cp gs://bucket-name/folder1/folder_to_copy/ gs://bucket-name/folder1/new_folder/
然而,当我检查 gsutil ls
时,存储桶中没有新文件夹。我做错了什么?
您应该使用 -r
选项递归复制文件夹及其内容:
gsutil cp -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder
请注意,这仅在 folder_to_copy
包含文件时有效。这是因为 Cloud Storage 并不像人们在典型 GUI 中所期望的那样真正具有“文件夹”,而是在“平面”名称 space 之上提供了分层文件树的错觉,如所解释的 here。换句话说,文件夹中的文件只是附加了文件夹前缀的对象。因此,当您执行 gsutil cp
时,它期望复制实际对象而不是 CLI 无法理解的空目录。
另一种方法是简单地使用 rsync
,它允许使用空文件夹并同步源文件夹和目标文件夹之间的内容:
gsutil rsync -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder
如果您还想保留对象的 ACL(权限),请使用 -p
选项:
gsutil rsync -p -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder
要添加到@Maxim 的答案,您可以考虑在调用 gsutil
时使用 -m
参数以允许并行复制。
gsutil -m cp -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder
-m
arg 启用并行性。
正如 gsutil
文档中所建议的那样,-m
arg 可能不会在慢速网络( 即 ,在家里)的情况下产生更好的性能。但是对于桶间复制(数据中心中的机器)的情况,性能可能会“显着提高”以引用 gsutil 手册。见下文
-m Causes supported operations (acl ch, acl set, cp, mv, rm, rsync,
and setmeta) to run in parallel. This can significantly improve
performance if you are performing operations on a large number of
files over a reasonably fast network connection.
gsutil performs the specified operation using a combination of
multi-threading and multi-processing, using a number of threads
and processors determined by the parallel_thread_count and
parallel_process_count values set in the boto configuration
file. You might want to experiment with these values, as the
best values can vary based on a number of factors, including
network speed, number of CPUs, and available memory.
Using the -m option may make your performance worse if you
are using a slower network, such as the typical network speeds
offered by non-business home network plans. It can also make
your performance worse for cases that perform all operations
locally (e.g., gsutil rsync, where both source and destination
URLs are on the local disk), because it can "thrash" your local
disk.
If a download or upload operation using parallel transfer fails
before the entire transfer is complete (e.g. failing after 300 of
1000 files have been transferred), you will need to restart the
entire transfer.
Also, although most commands will normally fail upon encountering
an error when the -m flag is disabled, all commands will
continue to try all operations when -m is enabled with multiple
threads or processes, and the number of failed operations (if any)
will be reported as an exception at the end of the command's
execution.
注意:在撰写本文时,python3.8 似乎会导致 -m
标志出现问题。使用 python3.7。有关此内容的更多信息 Github Issue
对于不想安装整个 SDK 而使用 Docker 的人,这里是我用来将 Bucket 下载到名为 googledata 的 Docker 卷的一系列命令。
(将 gs://assets 替换为您的 Bucket 名称)
docker pull google/cloud-sdk:latest
docker run -ti --name gcloud-config google/cloud-sdk gcloud auth login
docker run --rm -ti -v googledata:/tmp --volumes-from gcloud-config google/cloud-sdk gsutil cp -r gs://assets /tmp
Docker 容器参见 here。
为了获取您的数据付出了很多努力...
我已经阅读了有关 gsutil cp
命令的文档,但仍然不明白如何复制文件夹以保持相同的权限。我试过这个命令:
gsutil cp gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder
但结果出现错误:
CommandException: No URLs matched
不过,当我尝试在每个名字的末尾加上斜杠时,它没有显示任何错误:
gsutil cp gs://bucket-name/folder1/folder_to_copy/ gs://bucket-name/folder1/new_folder/
然而,当我检查 gsutil ls
时,存储桶中没有新文件夹。我做错了什么?
您应该使用 -r
选项递归复制文件夹及其内容:
gsutil cp -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder
请注意,这仅在 folder_to_copy
包含文件时有效。这是因为 Cloud Storage 并不像人们在典型 GUI 中所期望的那样真正具有“文件夹”,而是在“平面”名称 space 之上提供了分层文件树的错觉,如所解释的 here。换句话说,文件夹中的文件只是附加了文件夹前缀的对象。因此,当您执行 gsutil cp
时,它期望复制实际对象而不是 CLI 无法理解的空目录。
另一种方法是简单地使用 rsync
,它允许使用空文件夹并同步源文件夹和目标文件夹之间的内容:
gsutil rsync -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder
如果您还想保留对象的 ACL(权限),请使用 -p
选项:
gsutil rsync -p -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder
要添加到@Maxim 的答案,您可以考虑在调用 gsutil
时使用 -m
参数以允许并行复制。
gsutil -m cp -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder
-m
arg 启用并行性。
正如 gsutil
文档中所建议的那样,-m
arg 可能不会在慢速网络( 即 ,在家里)的情况下产生更好的性能。但是对于桶间复制(数据中心中的机器)的情况,性能可能会“显着提高”以引用 gsutil 手册。见下文
-m Causes supported operations (acl ch, acl set, cp, mv, rm, rsync,
and setmeta) to run in parallel. This can significantly improve
performance if you are performing operations on a large number of
files over a reasonably fast network connection.
gsutil performs the specified operation using a combination of
multi-threading and multi-processing, using a number of threads
and processors determined by the parallel_thread_count and
parallel_process_count values set in the boto configuration
file. You might want to experiment with these values, as the
best values can vary based on a number of factors, including
network speed, number of CPUs, and available memory.
Using the -m option may make your performance worse if you
are using a slower network, such as the typical network speeds
offered by non-business home network plans. It can also make
your performance worse for cases that perform all operations
locally (e.g., gsutil rsync, where both source and destination
URLs are on the local disk), because it can "thrash" your local
disk.
If a download or upload operation using parallel transfer fails
before the entire transfer is complete (e.g. failing after 300 of
1000 files have been transferred), you will need to restart the
entire transfer.
Also, although most commands will normally fail upon encountering
an error when the -m flag is disabled, all commands will
continue to try all operations when -m is enabled with multiple
threads or processes, and the number of failed operations (if any)
will be reported as an exception at the end of the command's
execution.
注意:在撰写本文时,python3.8 似乎会导致 -m
标志出现问题。使用 python3.7。有关此内容的更多信息 Github Issue
对于不想安装整个 SDK 而使用 Docker 的人,这里是我用来将 Bucket 下载到名为 googledata 的 Docker 卷的一系列命令。 (将 gs://assets 替换为您的 Bucket 名称)
docker pull google/cloud-sdk:latest
docker run -ti --name gcloud-config google/cloud-sdk gcloud auth login
docker run --rm -ti -v googledata:/tmp --volumes-from gcloud-config google/cloud-sdk gsutil cp -r gs://assets /tmp
Docker 容器参见 here。
为了获取您的数据付出了很多努力...