固定宽度的 Spark 写入输出

Question

将固定宽度的文件读入 Spark 很容易，并且有多种方法可以做到这一点。但是，我找不到从 spark (2.3.1) 写入固定宽度输出的方法。将 DF 转换为 RDD 有帮助吗？目前正在使用 Pyspark，但欢迎使用任何语言。有人可以建议出路吗？

Answer 1

这是我在 .

中描述的示例

您可以使用 pyspark.sql.functions.format_string() to format each column to a fixed width and then use pyspark.sql.functions.concat() 将它们组合成一个字符串。

例如，假设您有以下 DataFrame：

data = [
    (1, "one", "2016-01-01"),
    (2, "two", "2016-02-01"),
    (3, "three", "2016-03-01")
]

df = spark.createDataFrame(data, ["id", "value", "date"])
df.show()
#+---+-----+----------+
#| id|value|      date|
#+---+-----+----------+
#|  1|  one|2016-01-01|
#|  2|  two|2016-02-01|
#|  3|three|2016-03-01|
#+---+-----+----------+

假设您想以固定宽度 10 左对齐写出数据

from pyspark.sql.functions import concat, format_string

fixed_width = 10
ljust = r"%-{width}s".format(width=fixed_width)

df.select(
    concat(*[format_string(ljust,c) for c in df.columns]).alias("fixedWidth")
).show(truncate=False)
#+------------------------------+
#|fixedWidth                    |
#+------------------------------+
#|1         one       2016-01-01|
#|2         two       2016-02-01|
#|3         three     2016-03-01|
#+------------------------------+

此处我们使用 %-10s 的 printf 样式格式来指定左对齐宽度 10。

如果您想右对齐字符串，请删除负号：

rjust = r"%{width}s".format(width=fixed_width)

df.select(
    concat(*[format_string(rjust,c) for c in df.columns]).alias("fixedWidth")
).show(truncate=False)
#+------------------------------+
#|fixedWidth                    |
#+------------------------------+
#|         1       one2016-01-01|
#|         2       two2016-02-01|
#|         3     three2016-03-01|
#+------------------------------+

现在您可以只将 fixedWidth 列写入输出文件。

固定宽度的 Spark 写入输出

Spark writing output as fixed width

fixed-width

apache-spark

apache-spark-sql

pyspark