如何在 pyspark sql 或 Mysql 中按键对值求和
How to sum the values by key in pyspark sql or Mysql
我不知道如何按索引添加值帮我解决这个问题:根据键值的索引添加值
输入CSV:
Country,Values
Canada,47;97;33;94;6
Canada,59;98;24;83;3
Canada,77;63;93;86;62
China,86;71;72;23;27
China,74;69;72;93;7
China,58;99;90;93;41
England,40;13;85;75;90
England,39;13;33;29;14
England,99;88;57;69;49
Germany,67;93;90;57;3
Germany,0;9;15;20;19
Germany,77;64;46;95;48
India,90;49;91;14;70
India,70;83;38;27;16
India,86;21;19;59;4
输出的 csv 应该是:
Country,Values
Canada,183;258;150;263;71
China,218;239;234;209;75
England,178;114;175;173;153
Germany,144;166;151;172;70
India,246;153;148;100;90
Import required modules & create session
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
Read csv file & create view
file_df = spark_session.read.csv("/data/user/country_values.csv",header=True)
temp_view = file_df.createOrReplaceTempView("temp_view")
Write sql query
output_df = spark_session.sql("select country, concat( cast (sum (split(values,';') [0]) as int),';', cast (sum (split(Values,';')[1]) as int),';',cast(sum(split(values,';')[2]) as int),';', cast (sum (split(Values,';')[3]) as int),';', cast(sum (split(values,';')[4]) as int)) as values from temp_view group by Country order by country")
print result
output_df.show()
限制 -> SQL 查询的编写方式是值列在用“;”拆分后应具有 5 个值定界符。 values 列在查询中被拆分了 5 次,可以对其进行优化。
或者在主聚合之前使用 CTE 将值拆分为整数数组:
spark_session.sql("""
WITH t1 AS (
SELECT Country
, CAST(split(Values, ";") AS array<int>) AS V
FROM temp_view
)
SELECT Country
, CONCAT_WS(";", sum(V[0]), sum(V[1]), sum(V[2]), sum(V[3]), sum(V[4])) AS Values
FROM t1
GROUP BY Country
ORDER BY Country
""").show(truncate=False)
我不知道如何按索引添加值帮我解决这个问题:根据键值的索引添加值
输入CSV:
Country,Values
Canada,47;97;33;94;6
Canada,59;98;24;83;3
Canada,77;63;93;86;62
China,86;71;72;23;27
China,74;69;72;93;7
China,58;99;90;93;41
England,40;13;85;75;90
England,39;13;33;29;14
England,99;88;57;69;49
Germany,67;93;90;57;3
Germany,0;9;15;20;19
Germany,77;64;46;95;48
India,90;49;91;14;70
India,70;83;38;27;16
India,86;21;19;59;4
输出的 csv 应该是:
Country,Values
Canada,183;258;150;263;71
China,218;239;234;209;75
England,178;114;175;173;153
Germany,144;166;151;172;70
India,246;153;148;100;90
Import required modules & create session
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
Read csv file & create view
file_df = spark_session.read.csv("/data/user/country_values.csv",header=True)
temp_view = file_df.createOrReplaceTempView("temp_view")
Write sql query
output_df = spark_session.sql("select country, concat( cast (sum (split(values,';') [0]) as int),';', cast (sum (split(Values,';')[1]) as int),';',cast(sum(split(values,';')[2]) as int),';', cast (sum (split(Values,';')[3]) as int),';', cast(sum (split(values,';')[4]) as int)) as values from temp_view group by Country order by country")
print result
output_df.show()
限制 -> SQL 查询的编写方式是值列在用“;”拆分后应具有 5 个值定界符。 values 列在查询中被拆分了 5 次,可以对其进行优化。
或者在主聚合之前使用 CTE 将值拆分为整数数组:
spark_session.sql("""
WITH t1 AS (
SELECT Country
, CAST(split(Values, ";") AS array<int>) AS V
FROM temp_view
)
SELECT Country
, CONCAT_WS(";", sum(V[0]), sum(V[1]), sum(V[2]), sum(V[3]), sum(V[4])) AS Values
FROM t1
GROUP BY Country
ORDER BY Country
""").show(truncate=False)