将列与来自多行的数组合并

Merging column with array from multiple rows

我正在尝试合并数据集中的数据,如下所示:

id sms longDescription OtherFields
123 contentSms ContentDesc xxx
123 contentSms2 ContentDesc2 xxx
123 contentSms3 ContentDesc3 xxx
456 contentSms4 ContentDesc xxx

sms 和 longDescription 具有以下结构:

sms:array
|----element:struct
      |----content:string
      |----languageId:string

目的是抓取相同Id的数据,将sms列和longDescription列合并为一个多结构数组(以languageID为key):

id sms longDescription OtherFields
123 contentSms, ContentSms2,contentSms3 ContentDesc,ContentDesc2,ContentDesc3 xxx
456 contentSms4 ContentDesc xxx

我试过使用

x = df.select("*").groupBy("id").agg( collect_list("sms"))

但结果是:

collect_list(longDescription): array (nullable = false)
 |    |-- element: array (containsNull = false)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- content: string (nullable = true)
 |    |    |    |-- languageId: string (nullable = true)

这是一个数组太多了,因为目标是拥有一个结构数组,以便获得以下结果:

sms: [{content: 'aze', languageId:'en-GB'},{content: 'rty', languageId:'fr-BE'},{content: 'poiu', languageId:'nl-BE'}]

您正在寻找 flatten 函数:

x = df.groupBy("id").agg(flatten(collect_list("sms")))