将列与来自多行的数组合并
Merging column with array from multiple rows
我正在尝试合并数据集中的数据,如下所示:
id
sms
longDescription
OtherFields
123
contentSms
ContentDesc
xxx
123
contentSms2
ContentDesc2
xxx
123
contentSms3
ContentDesc3
xxx
456
contentSms4
ContentDesc
xxx
sms 和 longDescription 具有以下结构:
sms:array
|----element:struct
|----content:string
|----languageId:string
目的是抓取相同Id的数据,将sms
列和longDescription
列合并为一个多结构数组(以languageID
为key):
id
sms
longDescription
OtherFields
123
contentSms, ContentSms2,contentSms3
ContentDesc,ContentDesc2,ContentDesc3
xxx
456
contentSms4
ContentDesc
xxx
我试过使用
x = df.select("*").groupBy("id").agg( collect_list("sms"))
但结果是:
collect_list(longDescription): array (nullable = false)
| |-- element: array (containsNull = false)
| | |-- element: struct (containsNull = true)
| | | |-- content: string (nullable = true)
| | | |-- languageId: string (nullable = true)
这是一个数组太多了,因为目标是拥有一个结构数组,以便获得以下结果:
sms: [{content: 'aze', languageId:'en-GB'},{content: 'rty', languageId:'fr-BE'},{content: 'poiu', languageId:'nl-BE'}]
您正在寻找 flatten
函数:
x = df.groupBy("id").agg(flatten(collect_list("sms")))
我正在尝试合并数据集中的数据,如下所示:
id | sms | longDescription | OtherFields |
---|---|---|---|
123 | contentSms | ContentDesc | xxx |
123 | contentSms2 | ContentDesc2 | xxx |
123 | contentSms3 | ContentDesc3 | xxx |
456 | contentSms4 | ContentDesc | xxx |
sms 和 longDescription 具有以下结构:
sms:array
|----element:struct
|----content:string
|----languageId:string
目的是抓取相同Id的数据,将sms
列和longDescription
列合并为一个多结构数组(以languageID
为key):
id | sms | longDescription | OtherFields |
---|---|---|---|
123 | contentSms, ContentSms2,contentSms3 | ContentDesc,ContentDesc2,ContentDesc3 | xxx |
456 | contentSms4 | ContentDesc | xxx |
我试过使用
x = df.select("*").groupBy("id").agg( collect_list("sms"))
但结果是:
collect_list(longDescription): array (nullable = false)
| |-- element: array (containsNull = false)
| | |-- element: struct (containsNull = true)
| | | |-- content: string (nullable = true)
| | | |-- languageId: string (nullable = true)
这是一个数组太多了,因为目标是拥有一个结构数组,以便获得以下结果:
sms: [{content: 'aze', languageId:'en-GB'},{content: 'rty', languageId:'fr-BE'},{content: 'poiu', languageId:'nl-BE'}]
您正在寻找 flatten
函数:
x = df.groupBy("id").agg(flatten(collect_list("sms")))