如何将 Row 中的结构字段转换为 Spark 中的 avro 记录 Java
How to convert a struct field in a Row to an avro record in Spark Java
我有一个用例,我想将结构字段转换为 Avro 记录。 struct 字段最初映射到 Avro 类型。输入数据是avro文件,struct字段对应输入avro记录中的一个字段。
下面是我想用伪代码实现的。
DataSet<Row> data = loadInput(); // data is of form (foo, bar, myStruct) from avro data.
// do some joins to add more data
data = doJoins(data); // now data is of form (a, b, myStruct)
// transform DataSet<Row> to DataSet<MyType>
DataSet<MyType> myData = data.map(row -> myUDF(row), encoderOfMyType);
// method `myUDF` definition
MyType myUDF(Row row) {
String a = row.getAs("a");
String b = row.getAs("b");
// MyStruct is the generated avro class that corresponds to field myStruct
MyStruct myStruct = convertToAvro(row.getAs("myStruct"));
return generateMyType(a, b, myStruct);
}
我的问题是:如何在上面的伪代码中实现 convertToAvro
方法?
The Avro package provides function to_avro to encode a column as binary in Avro format, and from_avro() to decode Avro binary data into a column. Both functions transform one column to another column, and the input/output SQL data type can be a complex type or a primitive type.
函数 to_avro 替代了 convertToAvro
方法:
import static org.apache.spark.sql.avro.functions.*;
//put the avro schema of the struct column into a string
//in my example I assume that the struct consists of a two fields:
//a long field (s1) and a string field (s2)
String schema = "{\"type\":\"record\",\"name\":\"mystruct\"," +
"\"namespace\":\"topLevelRecord\",\"fields\":[{\"name\":\"s1\"," +
"\"type\":[\"long\",\"null\"]},{\"name\":\"s2\",\"type\":" +
"[\"string\",\"null\"]}]},\"null\"]}";
data = ...
//add an additional column containing the struct as binary column
Dataset<Row> data2 = df.withColumn("to_avro", to_avro(data.col("myStruct"), schema));
df2.printSchema();
df2.show(false);
打印
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- mystruct: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: string (nullable = true)
|-- to_avro: binary (nullable = true)
+----+----+----------+----------------------------+
|a |b |mystruct |to_avro |
+----+----+----------+----------------------------+
|foo1|bar1|[1, one] |[00 02 00 06 6F 6E 65] |
|foo2|bar2|[3, three]|[00 06 00 0A 74 68 72 65 65]|
+----+----+----------+----------------------------+
要将 avro 列转换回来,可以使用函数 from_avro:
Dataset<Row> data3 = data2.withColumn("from_avro", from_avro(data2.col("to_avro"), schema));
df3.printSchema();
df3.show();
输出:
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- mystruct: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: string (nullable = true)
|-- to_avro: binary (nullable = true)
|-- from_avro: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: string (nullable = true)
+----+----+----------+--------------------+----------+
| a| b| mystruct| to_avro| from_avro|
+----+----+----------+--------------------+----------+
|foo1|bar1| [1, one]|[00 02 00 06 6F 6...| [1, one]|
|foo2|bar2|[3, three]|[00 06 00 0A 74 6...|[3, three]|
+----+----+----------+--------------------+----------+
关于udf的一句话:在问题中,您在udf中执行了到avro格式的转换。我宁愿只在 udf 中包含实际的业务逻辑,而将格式转换放在外面。这将逻辑和格式转换分开。如有必要,您可以在创建 avro 列后删除原始列 mystruct
。
我有一个用例,我想将结构字段转换为 Avro 记录。 struct 字段最初映射到 Avro 类型。输入数据是avro文件,struct字段对应输入avro记录中的一个字段。
下面是我想用伪代码实现的。
DataSet<Row> data = loadInput(); // data is of form (foo, bar, myStruct) from avro data.
// do some joins to add more data
data = doJoins(data); // now data is of form (a, b, myStruct)
// transform DataSet<Row> to DataSet<MyType>
DataSet<MyType> myData = data.map(row -> myUDF(row), encoderOfMyType);
// method `myUDF` definition
MyType myUDF(Row row) {
String a = row.getAs("a");
String b = row.getAs("b");
// MyStruct is the generated avro class that corresponds to field myStruct
MyStruct myStruct = convertToAvro(row.getAs("myStruct"));
return generateMyType(a, b, myStruct);
}
我的问题是:如何在上面的伪代码中实现 convertToAvro
方法?
The Avro package provides function to_avro to encode a column as binary in Avro format, and from_avro() to decode Avro binary data into a column. Both functions transform one column to another column, and the input/output SQL data type can be a complex type or a primitive type.
函数 to_avro 替代了 convertToAvro
方法:
import static org.apache.spark.sql.avro.functions.*;
//put the avro schema of the struct column into a string
//in my example I assume that the struct consists of a two fields:
//a long field (s1) and a string field (s2)
String schema = "{\"type\":\"record\",\"name\":\"mystruct\"," +
"\"namespace\":\"topLevelRecord\",\"fields\":[{\"name\":\"s1\"," +
"\"type\":[\"long\",\"null\"]},{\"name\":\"s2\",\"type\":" +
"[\"string\",\"null\"]}]},\"null\"]}";
data = ...
//add an additional column containing the struct as binary column
Dataset<Row> data2 = df.withColumn("to_avro", to_avro(data.col("myStruct"), schema));
df2.printSchema();
df2.show(false);
打印
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- mystruct: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: string (nullable = true)
|-- to_avro: binary (nullable = true)
+----+----+----------+----------------------------+
|a |b |mystruct |to_avro |
+----+----+----------+----------------------------+
|foo1|bar1|[1, one] |[00 02 00 06 6F 6E 65] |
|foo2|bar2|[3, three]|[00 06 00 0A 74 68 72 65 65]|
+----+----+----------+----------------------------+
要将 avro 列转换回来,可以使用函数 from_avro:
Dataset<Row> data3 = data2.withColumn("from_avro", from_avro(data2.col("to_avro"), schema));
df3.printSchema();
df3.show();
输出:
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- mystruct: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: string (nullable = true)
|-- to_avro: binary (nullable = true)
|-- from_avro: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: string (nullable = true)
+----+----+----------+--------------------+----------+
| a| b| mystruct| to_avro| from_avro|
+----+----+----------+--------------------+----------+
|foo1|bar1| [1, one]|[00 02 00 06 6F 6...| [1, one]|
|foo2|bar2|[3, three]|[00 06 00 0A 74 6...|[3, three]|
+----+----+----------+--------------------+----------+
关于udf的一句话:在问题中,您在udf中执行了到avro格式的转换。我宁愿只在 udf 中包含实际的业务逻辑,而将格式转换放在外面。这将逻辑和格式转换分开。如有必要,您可以在创建 avro 列后删除原始列 mystruct
。