在 python 中循环以从 python 中的数组更新值的更快方法

Faster way to loop in python for updating value from a array in python

我有一个数据框 test 如下所示

 Student_Id  Math  Physical  Arts Class Sub_Class
0        id_1     6         7     9     A         x
1        id_2     9         7     1     A         y
2        id_3     3         5     5     C         x
3        id_4     6         8     9     A         x
4        id_5     6         7    10     B         z
5        id_6     9         5    10     B         z
6        id_7     3         5     6     C         x
7        id_8     3         4     6     C         x
8        id_9     6         8     9     A         x
9       id_10     6         7    10     B         z
10      id_11     9         5    10     B         z
11      id_12     3         5     6     C         x

我的代码部分列出了两个数组:arr_list和array_top。

我想创建一个新列,以便它循环遍历数据帧的每一行,然后更新数组中的值,如下所示:

for index, row in test.iterrows():
      test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]

对于更大的集合,此循环需要太多时间。 有没有更快的方法

我的代码

import pandas as pd
import numpy as np

#Ceate dataframe
data = [
    ["id_1",6,7,9, "A", "x"],
    ["id_2",9,7,1, "A","y" ],
    ["id_3",3,5,5, "C", "x"],
    ["id_4",6,8,9, "A","x" ],
    ["id_5",6,7,10, "B", "z"],
    ["id_6",9,5,10,"B", "z"],
    ["id_7",3,5,6, "C", "x"],
    ["id_8",3,4,6, "C", "x"],
    ["id_9",6,8,9, "A","x" ],
    ["id_10",6,7,10, "B", "z"],
    ["id_11",9,5,10,"B", "z"],
    ["id_12",3,5,6, "C", "x"]
    
]

test = pd.DataFrame(data, columns = ['Student_Id', 'Math', 'Physical','Arts', 'Class', 'Sub_Class'])


#Create two arrays which are of same length as the test data
arr_list = np.array([[1, 2, 3], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [1, 2, 3], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6]])

array_top = np.array([[0],[1],[1],[2],[1], [0], [0],[1],[1],[2],[1], [0]])

#Create the column Highest_Scoe
for index, row in test.iterrows():
      test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]

首先循环遍历数组以创建新列,然后将其分配给数据帧比循环遍历数据帧的每一行要快得多

71.7 µs vs 2.77 ms(a.k.a。39 倍快)我的计时赛

In [95]: %%timeit
    ...: new_test['Highest_Score'] = [arr_list[r][c][0] for r,c in enumerate(array_top)]
    ...:
    ...:
71.7 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [96]: %%timeit
    ...: for index, row in test.iterrows():
    ...:       test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]
    ...:
2.77 ms ± 49.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

作为向 pandas DataFrame 添加新数据的一般规则,您希望在 pandas 之外进行所有循环和编译,然后一次分配所有数据