使用 Spark Streaming 将非结构化数据持久化到 Hadoop

Question

我有一个使用 spark streaming 创建的摄取管道，我想将 RDD 作为大型非结构化 (JSONL) 数据文件存储在 hadoop 中以简化未来的分析。

将 astream 持久化到 hadoop 而不会产生大量小文件的最佳方法是什么？（因为 hadoop 对这些不好，而且它们使分析工作流程复杂化）

Answer 1

首先，我建议使用可以像 Cassandra 一样处理这种情况的持久层。但是，如果您对 HDFS 死心塌地，那么 the mailing list has an answer already

您可以使用 FileUtil.copyMerge（来自 hadoop fs）API 并指定 saveAsTextFiles 保存部分文本文件的文件夹路径。假设你的目录是 /a/b/c/ use

FileUtil.copyMerge(FileSystem of source, a/b/c, 
    FileSystem of destination, Path to the merged file say (a/b/c.txt), 
    true(to delete the original dir,null))

使用 Spark Streaming 将非结构化数据持久化到 Hadoop

Persisting unstructured data to hadoop using spark streaming

hadoop

hdfs

apache-spark

spark-streaming