为什么小文件会在 Google 文件系统中产生热点?

Why do small files create hot spots in the Google File System?

我不明白 Google File Systems Paper

A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file.

小文件有什么区别?许多客户端访问大文件是否同样可能导致问题?

我认为/阅读了以下内容:-

随后的一些文字可以帮助澄清:

However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunkfile and then started on hundreds of machines at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batchqueue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.

如果1000个客户端同时读取一个小文件,持有它唯一chunk的N个chunkservers会同时收到1000/N个请求。这种突如其来的负载就是热点。

大文件不会被给定的客户端一次读取(毕竟它们很大)。相反,他们将加载文件的某些部分,处理它,然后继续下一部分。

在分片(MapReduce、Hadoop)场景中,worker 甚至可能根本不读取相同的块; N 个客户端中的一个客户端将读取文件的 1/N 个块,与其他客户端不同。

即使在非分片场景中,实际上客户端也不会完全同步。他们可能最终都读取了整个文件,但采用随机访问模式,因此在统计上不存在热点。或者,如果他们按顺序阅读,他们会因为工作量的不同而失去同步(除非你有意同步客户端....但不要那样做)。

因此,即使有很多客户端,由于大文件所涉及的工作性质,较大的文件也较少出现热点。这不是 保证,这就是我认为你在问题中所说的,但实际上分布式客户端不会在多块文件的每个块上协同工作。