gzip 文件如何存储在 HDFS 中

Question

HDFS存储支持压缩格式存储压缩文件。我知道 gzip 压缩不支持夹板。假设现在该文件是一个 gzip 压缩文件，其压缩大小为 1 GB。现在我的问题是：

此文件将如何存储在 HDFS 中（块大小为 64MB）

由此link我了解到gzip格式使用DEFLATE来存储压缩数据，DEFLATE将数据存储为一系列压缩块。

但我无法完全理解并寻找广泛的解释。

更多疑点来自gzip压缩文件：

这个 1GB 的 gzip 压缩文件将有多少块。
它会在多个数据节点上运行吗？
如何将复制因子应用于此文件（Hadoop 集群复制因子为 3。）
什么是DEFLATE算法？
读取gzip压缩文件时采用了哪种算法？

我正在看这里广泛而详细的解释。

Answer 1

How this file will get stored in HDFS (Block size is 64MB) if splitting does not supported for zip file format?

所有 DFS 块将存储在单个 Datanode 中。如果您的块大小为 64 MB，文件为 1 GB，则具有 16 个 DFS 块（1 GB / 64 MB = 15.625）的 Datanode 将存储 1 GB 文件。

How many block will be there for this 1GB gzip compressed file.

1 GB / 64 MB = 15.625 ~ 16 个 DFS 块

How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)

与任何其他文件相同。如果文件是可分割的，则没有变化。如果文件不可拆分，将识别具有所需块数的数据节点。在这种情况下，3 个数据节点有 16 个可用的 DFS 块。

来自此 link 的源代码：http://grepcode.com/file_/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/hdfs/server/namenode/ReplicationTargetChooser.java/?v=source

和

http://grepcode.com/file_/repo1.maven.org/maven2/org.apache.hadoop/hadoop-hdfs/0.22.0/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyDefault.java/?v=source

/** The class is responsible for choosing the desired number of targets
 * for placing block replicas.
 * The replica placement strategy is that if the writer is on a datanode,
 * the 1st replica is placed on the local machine, 
 * otherwise a random datanode. The 2nd replica is placed on a datanode
 * that is on a different rack. The 3rd replica is placed on a datanode
 * which is on the same rack as the first replca.
 */

What is DEFLATE algorithm?

DELATE 是解压缩 GZIP 格式压缩文件的算法。

查看此幻灯片以了解针对不同 zip 文件变体的其他算法。

查看此 presentation 了解更多详情。

gzip 文件如何存储在 HDFS 中

How gzip file gets stored in HDFS

compression

algorithm

gzip

hadoop

hdfs