基于多个列值组合大型 Pandas DataFrame 的最有效方法

Question

我正在处理数个 Pandas DataFrame 中的信息，其中包含 10,000 多行。

我有...

df1,学生信息

    Class Number Student ID
0   13530159     201733468
1   13530159     201736271
2   13530159     201833263
3   13530159     201931506
4   13530159     201933329
...

df2，学生回答

    title                           time                stu_id      score
0   Unit 12 - Reading Homework      10/30/2020 22:06:53 202031164   100
1   Unit 10 - Vocabulary Homework   11/1/2020 21:07:44  202031674   100
2   Unit 10 - Vocabulary Homework   11/3/2020 17:20:55  202032311   100
3   Unit 12 - Reading Homework      11/6/2020 6:04:37   202031164   95
4   Unit 12 - Reading Homework      11/7/2020 5:49:15   202031164   90
...

我要...

一个包含 class 编号、学生 ID 和唯一作业标题列的 DataFrame。作业列应包含学生对该作业的最高分数。可以有 20 多个作业/列。一个学生可以为一个作业获得许多不同的分数。我只想要最高的。我还想省略特定日期后提交的分数。

df3，最高学生成绩

    Class Number Student ID  Unit 12 - Reading Homework   Unit 10 - Vocabulary Homework  ...
0   13530159     201733468   100                          85                             ...              
1   13530159     201736271   95                           70                             ...
2   13530159     201833263   75                           65                             ...
3   13530159     201931506   80                           85                             ...
4   13530159     201933329   65                           75                             ...
...

最有效的方法是什么？我会这样做几十次。

PS，DataFrame 基于 50+ Google 表格。我可以返回并从原始工作表编译一个新的 DataFrame，但这很耗时。我希望有更简单、更快捷的方法。

PPS，我读过类似的问题：Pandas: efficient way to combine dataframes, Pandas apply a function of multiple columns, row-wise, Conditionally fill column values based on another columns value in pandas，等等。其中None专门解决了我的问题。

Answer 1

当然我没有你的数据，所以我不得不“伪造”一些数据，但这应该可行：

import numpy
import pandas
import random

# Student info
df_1 = pandas.DataFrame(
    [
        {"Class Number": random.randint(13530159, 13530259), "Student ID": student_id}
        for student_id in range(201733468, 201735468)
    ]
)

# Student responses
df_2 = pandas.DataFrame(
    [
        {
            "title": f"Unit {random.randint(1, 10)}  - ...",
            "time": pandas.Timestamp(random.randint(1577870112, 1606814112), unit="s"),
            "stu_id": random.randint(201733468, 201735468),
            "score": random.randint(10, 100),
        }
        for _ in range(10000)
    ]
)

# Merge the two dataframes together
df = df_1.merge(df_2, left_on="Student ID", right_on="stu_id")

# Create a pivot table, using the "max" as an aggregation function
result = pandas.pivot_table(df, index=["Class Number", "Student ID"], columns="title", values="score", aggfunc=numpy.max).reset_index()

输出：

title  Class Number  Student ID  Unit 1  - ...  Unit 10  - ...  Unit 2  - ...  \
0          13530159   201733485            NaN             NaN            NaN   
1          13530159   201733705            NaN             NaN           16.0   
2          13530159   201734020            NaN            92.0           67.0   
3          13530159   201734028          100.0            42.0            NaN   
4          13530159   201734218            NaN            50.0           41.0   
...             ...         ...            ...             ...            ...   
1989       13530259   201734501            NaN            19.0           32.0   
1990       13530259   201734760            NaN             NaN            NaN   
1991       13530259   201734954            NaN             NaN            NaN   
1992       13530259   201735137            NaN             NaN           83.0   
1993       13530259   201735266            NaN            26.0            NaN   

title  Unit 3  - ...  Unit 4  - ...  Unit 5  - ...  Unit 6  - ...  \
0               45.0            NaN            NaN           39.0   
1               46.0            NaN            NaN            NaN   
2                NaN           89.0           88.0            NaN   
3                NaN            NaN            NaN            NaN   
4              100.0            NaN            NaN           88.0   
...              ...            ...            ...            ...   
1989             NaN            NaN           48.0            NaN   
1990            33.0            NaN            NaN            NaN   
1991             NaN            NaN            NaN           74.0   
1992             NaN            NaN            NaN           13.0   
1993            35.0           62.0            NaN           43.0   

title  Unit 7  - ...  Unit 8  - ...  Unit 9  - ...  
0                NaN           65.0           65.0  
1                NaN            NaN            NaN  
2               90.0            NaN           88.0  
3                NaN           16.0           92.0  
4                NaN           77.0            NaN  
...              ...            ...            ...  
1989            35.0           94.0            NaN  
1990            34.0            NaN           45.0  
1991             NaN           21.0           19.0  
1992             NaN           99.0           60.0  
1993            83.0           51.0            NaN  

[1994 rows x 12 columns]

注意：输出包含很多 NaN 值，但那是因为我随机生成数据。这意味着并非所有学生都会获得所有 class 的结果。如果 class 没有结果，则该值将为 NaN。

基于多个列值组合大型 Pandas DataFrame 的最有效方法

Most efficient way to combine large Pandas DataFrames based on multiple column values

python

performance

dataframe

pandas

data-wrangling