Apache Spark Python GroupByKey 或 reduceByKey 或 combineByKey
Apache Spark Python GroupByKey or reduceByKey or combineByKey
我正在尝试处理一个 3 GB file.The 文件结构,它包含多行并且一组 n 行可以按特定的键分组,每个键出现在特定的位置
这是示例文件结构
abc123Key1asdas
abc124Key1asdas
abc126Key1asasd
abcw23Key2asdad
asdfsaKey2asdsa
....
.....
.....
abcasdKeynasdas
asfssdfKeynasda
asdaasdKeynsdfa
我要实现的结构是
((Key1,(abc123Key1asdas,abc124Key1asdas,abc126Key1asasd)),(Key2,(abcw23Key2asdad,asdfsaKey2asdsa)),...(Keyn,(abcasdKeynasdas,asfssdfKeynasda,asdaasdKeynsdfa))
我正在尝试做这样的事情
lines = sc.textFile(fileName)
counts = lines.flatMap(lambda line: line.split('\n')).map(lambda line: (line[10:21],line))
output = counts.combineByKey().collect()
任何人都可以帮助我实现我想要做的事情吗?
只需将 combineByKey() 替换为 groupByKey() 即可。
示例代码
data = sc.parallelize(['abc123Key1asdas','abc123Key1asdas','abc123Key1asdas', 'abcw23Key2asdad', 'abcw23Key2asdad', 'abcasdKeynasdas', 'asfssdKeynasda', 'asdaasKeynsdfa'])
data.map(lambda line: (line[6:10],line)).groupByKey().mapValues(list).collect()
[('Key1', ['abc123Key1asdas', 'abc123Key1asdas', 'abc123Key1asdas']), ('Key2', ['abcw23Key2asdad', 'abcw23Key2asdad']), ('Keyn', ['abcasdKeynasdas', 'asfssdKeynasda', 'asdaasKeynsdfa'])]
我正在尝试处理一个 3 GB file.The 文件结构,它包含多行并且一组 n 行可以按特定的键分组,每个键出现在特定的位置
这是示例文件结构
abc123Key1asdas
abc124Key1asdas
abc126Key1asasd
abcw23Key2asdad
asdfsaKey2asdsa
....
.....
.....
abcasdKeynasdas
asfssdfKeynasda
asdaasdKeynsdfa
我要实现的结构是
((Key1,(abc123Key1asdas,abc124Key1asdas,abc126Key1asasd)),(Key2,(abcw23Key2asdad,asdfsaKey2asdsa)),...(Keyn,(abcasdKeynasdas,asfssdfKeynasda,asdaasdKeynsdfa))
我正在尝试做这样的事情
lines = sc.textFile(fileName)
counts = lines.flatMap(lambda line: line.split('\n')).map(lambda line: (line[10:21],line))
output = counts.combineByKey().collect()
任何人都可以帮助我实现我想要做的事情吗?
只需将 combineByKey() 替换为 groupByKey() 即可。
示例代码
data = sc.parallelize(['abc123Key1asdas','abc123Key1asdas','abc123Key1asdas', 'abcw23Key2asdad', 'abcw23Key2asdad', 'abcasdKeynasdas', 'asfssdKeynasda', 'asdaasKeynsdfa'])
data.map(lambda line: (line[6:10],line)).groupByKey().mapValues(list).collect()
[('Key1', ['abc123Key1asdas', 'abc123Key1asdas', 'abc123Key1asdas']), ('Key2', ['abcw23Key2asdad', 'abcw23Key2asdad']), ('Keyn', ['abcasdKeynasdas', 'asfssdKeynasda', 'asdaasKeynsdfa'])]