Bigtable:在行键上使用时间戳时避免热点
Bigtable: Avoiding hotspotting when using timestamps on row keys
Cloud Bigtable docs on schema design for time series 说:
In the vast majority of cases, time-series queries are accessing a given dataset for a given time period. Therefore, make sure that all of the data for a given time period is stored in contiguous rows, unless doing so would cause hotspotting.
此外,here's what they recommend to avoid hotspotting:
If you're storing a cell phone's battery status, and your row key consists of the word "BATTERY" plus a timestamp, the row key will always increase in sequence. Because Cloud Bigtable stores adjacent row keys on the same server node, all writes will focus only on one node until that node is full, at which point writes will move to the next node in the cluster.
建议现场推广:
Move fields from the column data into the row key to make writes non-contiguous.
例如:
BATTERY#20150301124501001 --> BATTERY#Corrie#20150301124501001
问题:
- 现场推广可以解决热点问题。不过,这不会让按时间范围查询有点困难吗?
- 另一方面,如果您只想按 TIMESTAMP 查询范围,是否可以避免热点?不会吧?
- Field promotion may solve hotspotting. Still, wouldn't that make querying by time range a little bit difficult?
这取决于您的查询内容。例如,你想查询T1到T2的Corrie电池状态,你可以很容易地构造一个行范围:[BATTERY#Corrie#T1
, BATTERY#Corrie#T2
]。但是,如果要查询所有用户的电池状态,则将扫描前缀为BATTERY
的所有行。
因此,您拥有的最重要的查询应该指定将哪些字段提升为行键。此外,具有高基数的字段在提升为行键时帮助更大,因为它们将负载分配给更多的平板电脑。
- On the other side, is hotspotting avoidable if you want to query a range ONLY by TIMESTAMP? Don't think so, right?
我不太清楚你说的"query a range only the timestamp"是什么意思,你能举个例子吗?
很大程度上取决于 "TIMESTAMP" 的含义。如果你总是想查询最后 10 分钟,那么你的所有查询都将在任何给定时间转到单个服务器,你将遇到热点问题。
还有一点需要注意的是,如果你的row key设计不当,写入会遇到热点,你将无法获得好的写入吞吐量。建议设计行键以避免热点。
Cloud Bigtable docs on schema design for time series 说:
In the vast majority of cases, time-series queries are accessing a given dataset for a given time period. Therefore, make sure that all of the data for a given time period is stored in contiguous rows, unless doing so would cause hotspotting.
此外,here's what they recommend to avoid hotspotting:
If you're storing a cell phone's battery status, and your row key consists of the word "BATTERY" plus a timestamp, the row key will always increase in sequence. Because Cloud Bigtable stores adjacent row keys on the same server node, all writes will focus only on one node until that node is full, at which point writes will move to the next node in the cluster.
建议现场推广:
Move fields from the column data into the row key to make writes non-contiguous.
例如:
BATTERY#20150301124501001 --> BATTERY#Corrie#20150301124501001
问题:
- 现场推广可以解决热点问题。不过,这不会让按时间范围查询有点困难吗?
- 另一方面,如果您只想按 TIMESTAMP 查询范围,是否可以避免热点?不会吧?
- Field promotion may solve hotspotting. Still, wouldn't that make querying by time range a little bit difficult?
这取决于您的查询内容。例如,你想查询T1到T2的Corrie电池状态,你可以很容易地构造一个行范围:[BATTERY#Corrie#T1
, BATTERY#Corrie#T2
]。但是,如果要查询所有用户的电池状态,则将扫描前缀为BATTERY
的所有行。
因此,您拥有的最重要的查询应该指定将哪些字段提升为行键。此外,具有高基数的字段在提升为行键时帮助更大,因为它们将负载分配给更多的平板电脑。
- On the other side, is hotspotting avoidable if you want to query a range ONLY by TIMESTAMP? Don't think so, right?
我不太清楚你说的"query a range only the timestamp"是什么意思,你能举个例子吗?
很大程度上取决于 "TIMESTAMP" 的含义。如果你总是想查询最后 10 分钟,那么你的所有查询都将在任何给定时间转到单个服务器,你将遇到热点问题。
还有一点需要注意的是,如果你的row key设计不当,写入会遇到热点,你将无法获得好的写入吞吐量。建议设计行键以避免热点。