Google Compute 上的 VM 能否检测到它们何时被迁移？

Question

当 Google 计算 VM 迁移到不同的硬件时，是否可以通知应用程序运行？

我是一个大量使用矢量指令 (SSE/AVX/AVX-512) 的应用程序 (HMMER) 的开发人员。我正在开发的版本在启动时探测其硬件以确定哪些向量指令可用并选择最佳指令集。

我们一直在研究运行我们在 Google Compute 和其他云引擎上的程序，一个问题是，如果 VM 从一台物理机迁移到另一台同时运行我们的程序，新机器可能支持不同的指令，导致我们的程序崩溃或执行速度比它能慢。

有没有办法在 VM 迁移时通知 Google 计算 VM 上的应用程序运行？我发现的唯一相关信息是您可以将 VM 设置为在迁移时执行 shutdown/reboot 序列，这会终止所有当前正在执行的程序，但至少会让用户知道他们需要重新启动程序。

Answer 1

我们确保您的 VM 实例永远不会以导致您的程序按照您描述的方式崩溃的方式在物理机器之间实时迁移。

但是，对于您的用例，您可能希望指定最低 CPU 平台版本。您可以使用它来确保例如您的实例具有可用的新 Skylake AVX 指令。有关详细信息，请参阅 Specifying the Minimum CPU Platform 上的文档。

Answer 2

根据 Live Migration 文档：

Live migration does not change any attributes or properties of the VM itself. The live migration process just transfers a running VM from one host machine to another. All VM properties and attributes remain unchanged, including things like internal and external IP addresses, instance metadata, block storage data and volumes, OS and application state, network settings, network connections, and so on.

Google 确实提供了很少的控件来设置 instance availability policies，它还可以让您控制实时迁移的各个方面。在这里，他们还提到了您可以寻找什么来确定何时发生了实时迁移。

Live migrate

By default, standard instances are set to live migrate, where Google Compute Engine automatically migrates your instance away from an infrastructure maintenance event, and your instance remains running during the migration. Your instance might experience a short period of decreased performance, although generally most instances should not notice any difference. This is ideal for instances that require constant uptime, and can tolerate a short period of decreased performance.

When Google Compute Engine migrates your instance, it reports a system event that is published to the list of zone operations. You can review this event by performing a gcloud compute operations list --zones ZONE request or by viewing the list of operations in the Google Cloud Platform Console, or through an API request. The event will appear with the following text:
compute.instances.migrateOnHostMaintenance

此外，当维护事件即将发生时，您可以直接在 VM 上进行检测。

Getting Live Migration Notices

The metadata server provides information about an instance's scheduling options and settings, through the scheduling/ directory and the maintenance-event attribute. You can use these attributes to learn about a virtual machine instance's scheduling options, and use this metadata to notify you when a maintenance event is about to happen through the maintenance-event attribute. By default, all virtual machine instances are set to live migrate so the metadata server will receive maintenance event notices before a VM instance is live migrated. If you opted to have your VM instance terminated during maintenance, then Compute Engine will automatically terminate and optionally restart your VM instance if the automaticRestart attribute is set. To learn more about maintenance events and instance behavior during the events, read about scheduling options and settings.

You can learn when a maintenance event will happen by querying the maintenance-event attribute periodically. The value of this attribute will change 60 seconds before a maintenance event starts, giving your application code a way to trigger any tasks you want to perform prior to a maintenance event, such as backing up data or updating logs. Compute Engine also offers a sample Python script to demonstrate how to check for maintenance event notices.

You can use the maintenance-event attribute with the waiting for updates feature to notify your scripts and applications when a maintenance event is about to start and end. This lets you automate any actions that you might want to run before or after the event. The following Python sample provides an example of how you might implement these two features together.

您还可以选择终止并有选择地重新启动您的实例。

Terminate and (optionally) restart

If you do not want your instance to live migrate, you can choose to terminate and optionally restart your instance. With this option, Google Compute Engine will signal your instance to shut down, wait for a short period of time for your instance to shut down cleanly, terminate the instance, and restart it away from the maintenance event. This option is ideal for instances that demand constant, maximum performance, and your overall application is built to handle instance failures or reboots.

查看 Setting availability policies 部分了解有关如何配置的更多详细信息。

如果您使用带 GPU 的实例或抢占式实例，请注意不支持实时迁移：

Live migration and GPUs

Instances with GPUs attached cannot be live migrated. They must be set to terminate and optionally restart. Compute Engine offers a 60 minute notice before a VM instance with a GPU attached is terminated. To learn more about these maintenance event notices, read Getting live migration notices.

To learn more about handling host maintenance with GPUs, read Handling host maintenance on the GPUs documentation.

Live migration for preemptible instances

You cannot configure a preemptible instances to live migrate. The maintenance behavior for preemptible instances is always set to TERMINATE by default, and you cannot change this option. It is also not possible to set the automatic restart option for preemptible instances.

正如 Ramesh 提到的，您可以指定最低 CPU 平台，以确保您只迁移到至少具有您指定的最低 CPU 平台的实例。 At a high level it looks like:

In summary, when you specify a minimum CPU platform:

Compute Engine always uses the minimum CPU platform where available.

If the minimum CPU platform is not available or the minimum CPU platform is older than the zone default, and a newer CPU platform is available for the same price, Compute Engine uses the newer platform.

If the minimum CPU platform is not available in the specified zone and there are no newer platforms available without extra cost, the server returns a 400 error indicating that the CPU is unavailable.

Google Compute 上的 VM 能否检测到它们何时被迁移？

Can VMs on Google Compute detect when they've been migrated?

virtual-machine

avx

google-compute-engine

avx512