为什么小文件会在 Google 文件系统中产生热点?
Why do small files create hot spots in the Google File System?
我不明白 Google File Systems Paper
A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients
are accessing the same file.
- 我假设(如果我错了请纠正我)大文件块存储在不同的块服务器上,从而分配负载。在这种情况下,假设 1000 个客户端从每个 chunkserver 访问文件的 1/100。所以每个 chunkserver 最终不可避免地会收到 1000 个请求。 (这与 1000 个客户端访问单个小文件不同。服务器收到 1000 个小文件请求或 1000 个大文件部分请求)
- 我阅读了一些关于稀疏文件的内容。小文件按照论文填满一个chunk或者几个chunk。所以据我所知,小文件不会被重建,因此我已经消除了它作为热点的可能原因。
However, hot spots did develop when GFS was first used
by a batch-queue system: an executable was written to GFS
as a single-chunkfile and then started on hundreds of machines
at the same time. The few chunkservers storing this
executable were overloaded by hundreds of simultaneous requests.
We fixed this problem by storing such executables
with a higher replication factor and by making the batchqueue
system stagger application start times. A potential
long-term solution is to allow clients to read data from other
clients in such situations.
在分片(MapReduce、Hadoop)场景中,worker 甚至可能根本不读取相同的块; N 个客户端中的一个客户端将读取文件的 1/N 个块,与其他客户端不同。
因此,即使有很多客户端,由于大文件所涉及的工作性质,较大的文件也较少出现热点。这不是 保证,这就是我认为你在问题中所说的,但实际上分布式客户端不会在多块文件的每个块上协同工作。
我不明白 Google File Systems Paper
A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file.
- 我假设(如果我错了请纠正我)大文件块存储在不同的块服务器上,从而分配负载。在这种情况下,假设 1000 个客户端从每个 chunkserver 访问文件的 1/100。所以每个 chunkserver 最终不可避免地会收到 1000 个请求。 (这与 1000 个客户端访问单个小文件不同。服务器收到 1000 个小文件请求或 1000 个大文件部分请求)
- 我阅读了一些关于稀疏文件的内容。小文件按照论文填满一个chunk或者几个chunk。所以据我所知,小文件不会被重建,因此我已经消除了它作为热点的可能原因。
However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunkfile and then started on hundreds of machines at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batchqueue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.
在分片(MapReduce、Hadoop)场景中,worker 甚至可能根本不读取相同的块; N 个客户端中的一个客户端将读取文件的 1/N 个块,与其他客户端不同。
因此,即使有很多客户端,由于大文件所涉及的工作性质,较大的文件也较少出现热点。这不是 保证,这就是我认为你在问题中所说的,但实际上分布式客户端不会在多块文件的每个块上协同工作。