pyspark 数据框中的 groupby 计数

Question

我的数据框看起来像 -

id      age      gender       category
1        34        m             b
1        34        m             c
1        34        m             b
2        28        f             a
2        28        f             b
3        23        f             c
3        23        f             c 
3        23        f             c

我希望我的数据框看起来像 -

id      age      gender       a      b      c
1        34        m          0      2      1
2        28        f          1      1      0
3        23        f          0      0      2

我已经完成了 -

from pyspark.sql import functions as F
df = df.groupby(['id','age','gender']).pivot('category').agg(F.count('category')).fillna(0)
df.show()

如何在pyspark中管理？有什么正确的方法可以管理这个东西

Answer 1

你的代码对我来说看起来不错，但是当我尝试运行它时，我看到了这个

df = spark.read.csv('dbfs:/FileStore/tables/txt_sample.txt',header=True,inferSchema=True,sep="\t")
df = df.groupby(['id','age','gender']).pivot('category').agg(count('category')).fillna(0)
df.show()

df:pyspark.sql.dataframe.DataFrame = [id: integer, age: integer ... 5 more fields]
+---+---+------+---+---+---+---+
| id|age|gender|  a|  b|  c| c |
+---+---+------+---+---+---+---+
|  2| 28|     f|  1|  1|  0|  0|
|  1| 34|     m|  0|  2|  1|  0|
|  3| 23|     f|  0|  0|  1|  2|
+---+---+------+---+---+---+---+

因为最后两行c后面多了一个space字符

只是 trim space 使用 rtrim()

df = spark.read.csv('dbfs:/FileStore/tables/txt_sample.txt',header=True,inferSchema=True,sep='\t')
df = df.withColumn('Category',rtrim(df['category'])).drop(df['category'])
df = df.groupby(['id','age','gender']).pivot('Category').agg(count('Category')).fillna(0)
df.show()

df:pyspark.sql.dataframe.DataFrame = [id: integer, age: integer ... 4 more fields]
+---+---+------+---+---+---+
| id|age|gender|  a|  b|  c|
+---+---+------+---+---+---+
|  2| 28|     f|  1|  1|  0|
|  1| 34|     m|  0|  2|  1|
|  3| 23|     f|  0|  0|  3|
+---+---+------+---+---+---+

pyspark 数据框中的 groupby 计数

groupby count in pyspark data frame

pyspark

pyspark-sql

pyspark-dataframes