为什么小文件会在 Google 文件系统中产生热点？

Why do small files create hot spots in the Google File System?

A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file.

小文件有什么区别？许多客户端访问大文件是否同样可能导致问题？

我认为/阅读了以下内容：-

我假设（如果我错了请纠正我）大文件块存储在不同的块服务器上，从而分配负载。在这种情况下，假设 1000 个客户端从每个 chunkserver 访问文件的 1/100。所以每个 chunkserver 最终不可避免地会收到 1000 个请求。（这与 1000 个客户端访问单个小文件不同。服务器收到 1000 个小文件请求或 1000 个大文件部分请求）
我阅读了一些关于稀疏文件的内容。小文件按照论文填满一个chunk或者几个chunk。所以据我所知，小文件不会被重建，因此我已经消除了它作为热点的可能原因。

随后的一些文字可以帮助澄清：

However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunkfile and then started on hundreds of machines at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batchqueue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.

如果1000个客户端同时读取一个小文件，持有它唯一chunk的N个chunkservers会同时收到1000/N个请求。这种突如其来的负载就是热点。

大文件不会被给定的客户端一次读取（毕竟它们很大）。相反，他们将加载文件的某些部分，处理它，然后继续下一部分。

在分片（MapReduce、Hadoop）场景中，worker 甚至可能根本不读取相同的块； N 个客户端中的一个客户端将读取文件的 1/N 个块，与其他客户端不同。

即使在非分片场景中，实际上客户端也不会完全同步。他们可能最终都读取了整个文件，但采用随机访问模式，因此在统计上不存在热点。或者，如果他们按顺序阅读，他们会因为工作量的不同而失去同步（除非你有意同步客户端....但不要那样做）。

因此，即使有很多客户端，由于大文件所涉及的工作性质，较大的文件也较少出现热点。这不是保证，这就是我认为你在问题中所说的，但实际上分布式客户端不会在多块文件的每个块上协同工作。

为什么小文件会在 Google 文件系统中产生热点？

Why do small files create hot spots in the Google File System?

distributed-computing

gfs

distributed-filesystem