如何有效地为数据框的单元格赋值并迭代另一个对象

How to assign values to cells of a dataframe efficiently iterating over another object

我得到了一个基本上由嵌套列表组成的生成器对象。 它包含大约 20.000 个列表,结构如下所示:

cases = [[0,36,12],[64,28,1],....

此对象中的每个列表代表属于一个进程的行。现在我想将 ProcessID 分配给数据帧的相应行。目前我使用 for 循环实现了这一点:

moc = df.iloc
processID = 0 
for process in cases:
  for step in process:
    moc[process,-1] = processID
  processID += 1

尽管这可行,但遍历 for 循环需要很长时间,所以我对分配 processID 的更有效方法很感兴趣。

由于我需要遍历 cases 对象,并且由于嵌套列表的长度不同,我不知道如何实现更高效的流程,例如 df.apply() 或 np.where() .

感谢任何帮助。

示例:

import pandas as pd
import numpy as np

cases = [[1,4,2],[3,5,0],[9,6],[7,8]]


d = {'col1': ["some_information", "some_information","some_information",
              "some_information","some_information","some_information", 
              "some_information","some_information","some_information",
              "some_information"],
    'processID':np.empty}

df = pd.DataFrame(data=d)

print(df)
               col1                  processID
0  some_information  <built-in function empty>
1  some_information  <built-in function empty>
2  some_information  <built-in function empty>
3  some_information  <built-in function empty>
4  some_information  <built-in function empty>
5  some_information  <built-in function empty>
6  some_information  <built-in function empty>
7  some_information  <built-in function empty>
8  some_information  <built-in function empty>
9  some_information  <built-in function empty>

moc = df.iloc
processID = 1
for case in cases:
    for idx in case:
        moc[idx,-1] = processID

    processID += 1


print(df)
               col1 processID
0  some_information         2
1  some_information         1
2  some_information         1
3  some_information         2
4  some_information         1
5  some_information         2
6  some_information         3
7  some_information         4
8  some_information         4
9  some_information         3

IIUC,这是一个使用字典理解和 Index.repeat and numpy.hstack 来创建 pandas.Series 的解决方案,您可以使用它来更新 DataFrame。这样做的好处是没有循环。

s = pd.Series({(i+1):x for i, x in enumerate(cases)})
processes = pd.Series(s.index.repeat(s.str.len()), index=np.hstack(s))

根据您的示例 cases,这将创建一个 Series 'processes',例如:

1    1
4    1
2    1
3    2
5    2
0    2
9    3
6    3
7    4
8    4

然后你可以分配到你的 DataFrame:

df['processID'] = processes

测试性能

设置 - 创建一个长度为 100,000 的 DataFrame 和随机 cases 列表:

idx = pd.Series(np.arange(100000)).sample(frac=1).values.tolist()
cases = [idx[i:i + 3] for i in range(0, len(idx), 3)]

df=pd.DataFrame({'col1':np.arange(100000),
                 'col2':['some_data']*100000})

时机

%%timeit

s = pd.Series({(i+1):x for i, x in enumerate(cases)}).to_frame()
processes = pd.Series(s.index.repeat(s[0].str.len()), index=np.hstack(s[0]))
df['processID'] = processes

92.2 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)