Spark HBase/BigTable - Wide/sparse 数据帧持久性

Spark HBase/BigTable - Wide/sparse dataframe persistence

我想将一个非常宽的 Spark Dataframe（>100,000 列）持久保存到 BigTable，该数据帧稀疏填充（>99% 的值为空），同时仅保留非空值（以避免存储成本）。

有没有办法在Spark中指定写入时忽略空值？

谢谢！

可能（未测试），在将 Spark DataFrame 写入 HBase/BigTable 之前，您可以通过使用自定义函数过滤掉每行中具有空值的列来转换它，如此处所建议的示例使用pandas：。但是，据我所知，没有支持此功能的内置连接器。

或者，您可以尝试以 Parquet 等列式文件格式存储数据，因为它们 efficiently handle persistence of sparse columnar data (at least in terms of output size in bytes). But to avoid writing many small files (due to sparse nature of the data) which can decrease write throughput, you probably will need to decrease number of output partitions before performing a write (i.e. write more rows per each parquet file: )

Spark HBase/BigTable - Wide/sparse 数据帧持久性

Spark HBase/BigTable - Wide/sparse dataframe persistence

hbase

sparse-matrix

apache-spark

google-cloud-bigtable

google-cloud-dataproc