按爱好分组
Grouping people by hobbies
我一直在尝试解决这个问题,但无法真正将它与任何解决方案联系起来。我有以下数据集:
[
{"name": "sam", "hobbies": ["Books", "Music", "Gym"]},
{"name": "Steve", "hobbies": ["Books", "Swimming"]},
{"name": "Alex", "hobbies": ["Gym", "Music"]}
]
我正在尝试生成可以根据爱好组合人们的输出数据集。所以输出应该是这样的:
[
{"names": ["sam", "Steve"], "hobbies": ["Books"]},
{"names": ["sam", "Alex"], "hobbies": ["Music", "Gym"]},
{"names": ["Steve"], "hobbies": ["Swimming"]}
]
它是一个大型数据集,所以我尝试使用 Spark。
我尝试过的事情:
最初我想看看它是否是图问题,我可以使用强连通分量之类的东西,但看起来不能解决问题。
每个输出行看起来像一个二分图,但我也找不到生成它的方法。
另一种方法是聚类,但我认为它不是确定性的。如果我错了,请告诉我。我不是很熟悉。
如果我在这里遗漏了一些明显的东西,请告诉我。谢谢
检查下面的代码。
scala> df.show(false)
+-------------------+-----+
|hobbies |name |
+-------------------+-----+
|[Books, Music, Gym]|sam |
|[Books, Swimming] |Steve|
|[Gym, Music] |Alex |
+-------------------+-----+
使用groupBy
& collect_list
- 按
hobbies
分组并收集 names
的列表
- 按
names
分组并收集 hobbies
的列表
scala> :paste
// Entering paste mode (ctrl-D to finish)
df
.withColumn("hobbies",explode($"hobbies"))
.groupBy($"hobbies").agg(collect_list($"name").as("names")) // For Hobbies List
.groupBy($"name").agg(collect_list($"hobbies").as("hobbies")) // For Name List
.select(collect_list(to_json(struct($"hobbies",$"names"))).as("data")) // Final Json Output
.show(false)
// Exiting paste mode, now interpreting.
+--------------------------------------------------------------------------------------------------------------------------------------------+
|data |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|[{"hobbies":["Swimming"],"names":["Steve"]}, {"hobbies":["Books"],"names":["sam","Steve"]}, {"hobbies":["Music","Gym"],"names":["sam","Alex"]}]|
+--------------------------------------------------------------------------------------------------------------------------------------------+
格式化输出
[
{ "hobbies": ["Swimming"],"names": ["Steve"]},
{"hobbies": ["Books"],"names": ["sam","Steve"]},
{"hobbies": ["Music","Gym"],"names": ["sam","Alex"]}
]
我一直在尝试解决这个问题,但无法真正将它与任何解决方案联系起来。我有以下数据集:
[
{"name": "sam", "hobbies": ["Books", "Music", "Gym"]},
{"name": "Steve", "hobbies": ["Books", "Swimming"]},
{"name": "Alex", "hobbies": ["Gym", "Music"]}
]
我正在尝试生成可以根据爱好组合人们的输出数据集。所以输出应该是这样的:
[
{"names": ["sam", "Steve"], "hobbies": ["Books"]},
{"names": ["sam", "Alex"], "hobbies": ["Music", "Gym"]},
{"names": ["Steve"], "hobbies": ["Swimming"]}
]
它是一个大型数据集,所以我尝试使用 Spark。
我尝试过的事情:
最初我想看看它是否是图问题,我可以使用强连通分量之类的东西,但看起来不能解决问题。
每个输出行看起来像一个二分图,但我也找不到生成它的方法。
另一种方法是聚类,但我认为它不是确定性的。如果我错了,请告诉我。我不是很熟悉。
如果我在这里遗漏了一些明显的东西,请告诉我。谢谢
检查下面的代码。
scala> df.show(false)
+-------------------+-----+
|hobbies |name |
+-------------------+-----+
|[Books, Music, Gym]|sam |
|[Books, Swimming] |Steve|
|[Gym, Music] |Alex |
+-------------------+-----+
使用groupBy
& collect_list
- 按
hobbies
分组并收集names
的列表
- 按
names
分组并收集hobbies
的列表
scala> :paste
// Entering paste mode (ctrl-D to finish)
df
.withColumn("hobbies",explode($"hobbies"))
.groupBy($"hobbies").agg(collect_list($"name").as("names")) // For Hobbies List
.groupBy($"name").agg(collect_list($"hobbies").as("hobbies")) // For Name List
.select(collect_list(to_json(struct($"hobbies",$"names"))).as("data")) // Final Json Output
.show(false)
// Exiting paste mode, now interpreting.
+--------------------------------------------------------------------------------------------------------------------------------------------+
|data |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|[{"hobbies":["Swimming"],"names":["Steve"]}, {"hobbies":["Books"],"names":["sam","Steve"]}, {"hobbies":["Music","Gym"],"names":["sam","Alex"]}]|
+--------------------------------------------------------------------------------------------------------------------------------------------+
格式化输出
[
{ "hobbies": ["Swimming"],"names": ["Steve"]},
{"hobbies": ["Books"],"names": ["sam","Steve"]},
{"hobbies": ["Music","Gym"],"names": ["sam","Alex"]}
]