如果整个区域都断电,本地 SSD 会发生什么情况?
What happens to Local SSD if the entire zone were to lose power?
如果整个 google 数据中心遭受灾难性断电,本地 SSD 上的数据会怎样?当计算引擎实例最终重新上线时,它是否还会将数据保存在本地 SSD 上?它似乎可以很好地处理计划停机时间:
No planned downtime: Local SSD data will not be lost when Google does
datacenter maintenance, even without replication or redundancy. We
will use our live migration technology to move your VMs along with
their local SSD to a new machine in advance of any planned
maintenance, so your applications are not disrupted and your data is
not lost.
但我担心计划外停机。磁盘故障是一个永远存在的风险,但如果将本地 SSD 与复制相结合,就可以防范这种情况。但是,我正在尝试防止相关故障,例如整个地区都变黑了。然后内存中复制的数据丢失了,但是当实例恢复时,fsynced 到本地 SSD 的数据是否可能存在?如果你丢失了它,那么将数据同步到本地 SSD 真的不会比 RAM 更安全 - 例如对于 运行 本地数据库实例或其他东西。
顺便说一句,请注意 Google data center power supplies are redundant 并备有备用发电机以防相关电源出现故障:
Powering our data centers
To keep things running 24/7 and ensure uninterrupted services, Google’s data centers feature redundant power systems and environmental controls. Every critical component has a primary and alternate power source, each with equal power. Diesel engine backup generators can provide enough emergency electrical power to run each data center at full capacity. Cooling systems maintain a constant operating temperature for servers and other hardware, reducing the risk of service outages. Fire detection and suppression equipment helps prevent damage to hardware. Heat, fire, and smoke detectors trigger audible and visible alarms in the affected zone, at security operations consoles, and at remote monitoring desks.
回到你的问题。你问:
Then the in-memory replicated data is lost, but does the data fsynced to the local SSD likely survive when the instances come back up?
根据local SSD documentation(强调原文):
[...] local SSD storage is not automatically replicated and all data can be lost in the event of an instance reset, host error, or user configuration error that makes the disk unreachable. Users must take extra precautions to back up their data.
如果上述所有保护措施都失效,断电相当于实例重置,可能导致本地SSD卷无法访问——虚拟机很有可能在不同的物理机上重启,如果它确实如此,数据基本上是 "lost",因为它无法访问和擦除。
因此,您应该将本地 SSD 数据视为瞬态,就像您将 RAM 视为瞬态一样。
您还问过:
However, I'm trying to guard against correlated failure, where e.g. the whole region goes dark.
如果要防止区域中断,请跨区域中的多个区域进行复制。如果要防止整个区域中断,请复制到其他区域。如果您想防止相关区域出现故障,请复制到更多区域。
您还可以将数据快照存储在 Google Cloud Storage which provides a high level of durability:
Google Cloud Storage is designed for 99.999999999% durability; multiple copies, multiple locations with checksums and cross region striping of data.
如果整个 google 数据中心遭受灾难性断电,本地 SSD 上的数据会怎样?当计算引擎实例最终重新上线时,它是否还会将数据保存在本地 SSD 上?它似乎可以很好地处理计划停机时间:
No planned downtime: Local SSD data will not be lost when Google does datacenter maintenance, even without replication or redundancy. We will use our live migration technology to move your VMs along with their local SSD to a new machine in advance of any planned maintenance, so your applications are not disrupted and your data is not lost.
但我担心计划外停机。磁盘故障是一个永远存在的风险,但如果将本地 SSD 与复制相结合,就可以防范这种情况。但是,我正在尝试防止相关故障,例如整个地区都变黑了。然后内存中复制的数据丢失了,但是当实例恢复时,fsynced 到本地 SSD 的数据是否可能存在?如果你丢失了它,那么将数据同步到本地 SSD 真的不会比 RAM 更安全 - 例如对于 运行 本地数据库实例或其他东西。
顺便说一句,请注意 Google data center power supplies are redundant 并备有备用发电机以防相关电源出现故障:
Powering our data centers
To keep things running 24/7 and ensure uninterrupted services, Google’s data centers feature redundant power systems and environmental controls. Every critical component has a primary and alternate power source, each with equal power. Diesel engine backup generators can provide enough emergency electrical power to run each data center at full capacity. Cooling systems maintain a constant operating temperature for servers and other hardware, reducing the risk of service outages. Fire detection and suppression equipment helps prevent damage to hardware. Heat, fire, and smoke detectors trigger audible and visible alarms in the affected zone, at security operations consoles, and at remote monitoring desks.
回到你的问题。你问:
Then the in-memory replicated data is lost, but does the data fsynced to the local SSD likely survive when the instances come back up?
根据local SSD documentation(强调原文):
[...] local SSD storage is not automatically replicated and all data can be lost in the event of an instance reset, host error, or user configuration error that makes the disk unreachable. Users must take extra precautions to back up their data.
如果上述所有保护措施都失效,断电相当于实例重置,可能导致本地SSD卷无法访问——虚拟机很有可能在不同的物理机上重启,如果它确实如此,数据基本上是 "lost",因为它无法访问和擦除。
因此,您应该将本地 SSD 数据视为瞬态,就像您将 RAM 视为瞬态一样。
您还问过:
However, I'm trying to guard against correlated failure, where e.g. the whole region goes dark.
如果要防止区域中断,请跨区域中的多个区域进行复制。如果要防止整个区域中断,请复制到其他区域。如果您想防止相关区域出现故障,请复制到更多区域。
您还可以将数据快照存储在 Google Cloud Storage which provides a high level of durability:
Google Cloud Storage is designed for 99.999999999% durability; multiple copies, multiple locations with checksums and cross region striping of data.