如何在 for 循环中使用 Spark 随后在数据集中添加列（其中 for 循环包含列名）

Question

此处尝试将后续列添加到数据集行，出现的问题是最后一列仅可见。之前添加的列不会持续存在

private static void populate(Dataset<Row> res, String[] args)
    {
        String[] propArr = args[0].split(",");   // Eg: [abc, def, ghi]       
            
        // Dataset<Row> addColToMergedData = null;
        
        /** Here each element is the name of the column to be inserted */
        for(int i = 0; i < propArr.length; i++){

            // addColToMergedData = res.withColumn(propArr[i], lit(null));
        }
    }

Answer 1

for 循环 中的逻辑存在缺陷，因此出现了问题。您可以按如下方式修改程序：

private static void populate(Dataset<Row> res, String[] args)
        {
                String[] propArr = args[0].split(",");   // Eg: [abc, def, ghi]       
               
                Dataset<Row> addColToMergedData = null;
        
                /** Here each element is the name of the column to be inserted */
                for(int i = 0; i < propArr.length; i++)
                {
                    res = res.withColumn(propArr[i], lit(null));
                }
                addColToMergedData  = res

        }

Answer 2

太阳:

// addColToMergedData = res.withColumn(colMap.get(propArr[i]), lit(null));

应该写成： res = res.withColumn(colMap.get(propArr[i]), lit(null));

如何在 for 循环中使用 Spark 随后在数据集中添加列（其中 for 循环包含列名）

How to add columns subsequently in a dataset using Spark within a for loop ( where for loop contains the column name)

apache-spark-sql

apache-spark-dataset