基于多个列值组合大型 Pandas DataFrame 的最有效方法
Most efficient way to combine large Pandas DataFrames based on multiple column values
我正在处理数个 Pandas DataFrame 中的信息,其中包含 10,000 多行。
我有...
df1,学生信息
Class Number Student ID
0 13530159 201733468
1 13530159 201736271
2 13530159 201833263
3 13530159 201931506
4 13530159 201933329
...
df2,学生回答
title time stu_id score
0 Unit 12 - Reading Homework 10/30/2020 22:06:53 202031164 100
1 Unit 10 - Vocabulary Homework 11/1/2020 21:07:44 202031674 100
2 Unit 10 - Vocabulary Homework 11/3/2020 17:20:55 202032311 100
3 Unit 12 - Reading Homework 11/6/2020 6:04:37 202031164 95
4 Unit 12 - Reading Homework 11/7/2020 5:49:15 202031164 90
...
我要...
一个包含 class 编号、学生 ID 和唯一作业标题列的 DataFrame。作业列应包含学生对该作业的 最高 分数。可以有 20 多个作业/列。一个学生可以为一个作业获得许多不同的分数。我只想要最高的。我还想省略特定日期后提交的分数。
df3,最高学生成绩
Class Number Student ID Unit 12 - Reading Homework Unit 10 - Vocabulary Homework ...
0 13530159 201733468 100 85 ...
1 13530159 201736271 95 70 ...
2 13530159 201833263 75 65 ...
3 13530159 201931506 80 85 ...
4 13530159 201933329 65 75 ...
...
最有效的方法是什么?我会这样做几十次。
PS,DataFrame 基于 50+ Google 表格。我可以返回并从原始工作表编译一个新的 DataFrame,但这很耗时。我希望有更简单、更快捷的方法。
PPS,我读过类似的问题:Pandas: efficient way to combine dataframes, Pandas apply a function of multiple columns, row-wise, Conditionally fill column values based on another columns value in pandas,等等。其中None专门解决了我的问题。
当然我没有你的数据,所以我不得不“伪造”一些数据,但这应该可行:
import numpy
import pandas
import random
# Student info
df_1 = pandas.DataFrame(
[
{"Class Number": random.randint(13530159, 13530259), "Student ID": student_id}
for student_id in range(201733468, 201735468)
]
)
# Student responses
df_2 = pandas.DataFrame(
[
{
"title": f"Unit {random.randint(1, 10)} - ...",
"time": pandas.Timestamp(random.randint(1577870112, 1606814112), unit="s"),
"stu_id": random.randint(201733468, 201735468),
"score": random.randint(10, 100),
}
for _ in range(10000)
]
)
# Merge the two dataframes together
df = df_1.merge(df_2, left_on="Student ID", right_on="stu_id")
# Create a pivot table, using the "max" as an aggregation function
result = pandas.pivot_table(df, index=["Class Number", "Student ID"], columns="title", values="score", aggfunc=numpy.max).reset_index()
输出:
title Class Number Student ID Unit 1 - ... Unit 10 - ... Unit 2 - ... \
0 13530159 201733485 NaN NaN NaN
1 13530159 201733705 NaN NaN 16.0
2 13530159 201734020 NaN 92.0 67.0
3 13530159 201734028 100.0 42.0 NaN
4 13530159 201734218 NaN 50.0 41.0
... ... ... ... ... ...
1989 13530259 201734501 NaN 19.0 32.0
1990 13530259 201734760 NaN NaN NaN
1991 13530259 201734954 NaN NaN NaN
1992 13530259 201735137 NaN NaN 83.0
1993 13530259 201735266 NaN 26.0 NaN
title Unit 3 - ... Unit 4 - ... Unit 5 - ... Unit 6 - ... \
0 45.0 NaN NaN 39.0
1 46.0 NaN NaN NaN
2 NaN 89.0 88.0 NaN
3 NaN NaN NaN NaN
4 100.0 NaN NaN 88.0
... ... ... ... ...
1989 NaN NaN 48.0 NaN
1990 33.0 NaN NaN NaN
1991 NaN NaN NaN 74.0
1992 NaN NaN NaN 13.0
1993 35.0 62.0 NaN 43.0
title Unit 7 - ... Unit 8 - ... Unit 9 - ...
0 NaN 65.0 65.0
1 NaN NaN NaN
2 90.0 NaN 88.0
3 NaN 16.0 92.0
4 NaN 77.0 NaN
... ... ... ...
1989 35.0 94.0 NaN
1990 34.0 NaN 45.0
1991 NaN 21.0 19.0
1992 NaN 99.0 60.0
1993 83.0 51.0 NaN
[1994 rows x 12 columns]
注意:输出包含很多 NaN 值,但那是因为我随机生成数据。这意味着并非所有学生都会获得所有 class 的结果。如果 class 没有结果,则该值将为 NaN。
我正在处理数个 Pandas DataFrame 中的信息,其中包含 10,000 多行。
我有...
df1,学生信息
Class Number Student ID
0 13530159 201733468
1 13530159 201736271
2 13530159 201833263
3 13530159 201931506
4 13530159 201933329
...
df2,学生回答
title time stu_id score
0 Unit 12 - Reading Homework 10/30/2020 22:06:53 202031164 100
1 Unit 10 - Vocabulary Homework 11/1/2020 21:07:44 202031674 100
2 Unit 10 - Vocabulary Homework 11/3/2020 17:20:55 202032311 100
3 Unit 12 - Reading Homework 11/6/2020 6:04:37 202031164 95
4 Unit 12 - Reading Homework 11/7/2020 5:49:15 202031164 90
...
我要...
一个包含 class 编号、学生 ID 和唯一作业标题列的 DataFrame。作业列应包含学生对该作业的 最高 分数。可以有 20 多个作业/列。一个学生可以为一个作业获得许多不同的分数。我只想要最高的。我还想省略特定日期后提交的分数。
df3,最高学生成绩
Class Number Student ID Unit 12 - Reading Homework Unit 10 - Vocabulary Homework ...
0 13530159 201733468 100 85 ...
1 13530159 201736271 95 70 ...
2 13530159 201833263 75 65 ...
3 13530159 201931506 80 85 ...
4 13530159 201933329 65 75 ...
...
最有效的方法是什么?我会这样做几十次。
PS,DataFrame 基于 50+ Google 表格。我可以返回并从原始工作表编译一个新的 DataFrame,但这很耗时。我希望有更简单、更快捷的方法。
PPS,我读过类似的问题:Pandas: efficient way to combine dataframes, Pandas apply a function of multiple columns, row-wise, Conditionally fill column values based on another columns value in pandas,等等。其中None专门解决了我的问题。
当然我没有你的数据,所以我不得不“伪造”一些数据,但这应该可行:
import numpy
import pandas
import random
# Student info
df_1 = pandas.DataFrame(
[
{"Class Number": random.randint(13530159, 13530259), "Student ID": student_id}
for student_id in range(201733468, 201735468)
]
)
# Student responses
df_2 = pandas.DataFrame(
[
{
"title": f"Unit {random.randint(1, 10)} - ...",
"time": pandas.Timestamp(random.randint(1577870112, 1606814112), unit="s"),
"stu_id": random.randint(201733468, 201735468),
"score": random.randint(10, 100),
}
for _ in range(10000)
]
)
# Merge the two dataframes together
df = df_1.merge(df_2, left_on="Student ID", right_on="stu_id")
# Create a pivot table, using the "max" as an aggregation function
result = pandas.pivot_table(df, index=["Class Number", "Student ID"], columns="title", values="score", aggfunc=numpy.max).reset_index()
输出:
title Class Number Student ID Unit 1 - ... Unit 10 - ... Unit 2 - ... \
0 13530159 201733485 NaN NaN NaN
1 13530159 201733705 NaN NaN 16.0
2 13530159 201734020 NaN 92.0 67.0
3 13530159 201734028 100.0 42.0 NaN
4 13530159 201734218 NaN 50.0 41.0
... ... ... ... ... ...
1989 13530259 201734501 NaN 19.0 32.0
1990 13530259 201734760 NaN NaN NaN
1991 13530259 201734954 NaN NaN NaN
1992 13530259 201735137 NaN NaN 83.0
1993 13530259 201735266 NaN 26.0 NaN
title Unit 3 - ... Unit 4 - ... Unit 5 - ... Unit 6 - ... \
0 45.0 NaN NaN 39.0
1 46.0 NaN NaN NaN
2 NaN 89.0 88.0 NaN
3 NaN NaN NaN NaN
4 100.0 NaN NaN 88.0
... ... ... ... ...
1989 NaN NaN 48.0 NaN
1990 33.0 NaN NaN NaN
1991 NaN NaN NaN 74.0
1992 NaN NaN NaN 13.0
1993 35.0 62.0 NaN 43.0
title Unit 7 - ... Unit 8 - ... Unit 9 - ...
0 NaN 65.0 65.0
1 NaN NaN NaN
2 90.0 NaN 88.0
3 NaN 16.0 92.0
4 NaN 77.0 NaN
... ... ... ...
1989 35.0 94.0 NaN
1990 34.0 NaN 45.0
1991 NaN 21.0 19.0
1992 NaN 99.0 60.0
1993 83.0 51.0 NaN
[1994 rows x 12 columns]
注意:输出包含很多 NaN 值,但那是因为我随机生成数据。这意味着并非所有学生都会获得所有 class 的结果。如果 class 没有结果,则该值将为 NaN。