如何有效地为数据框的单元格赋值并迭代另一个对象
How to assign values to cells of a dataframe efficiently iterating over another object
我得到了一个基本上由嵌套列表组成的生成器对象。
它包含大约 20.000 个列表,结构如下所示:
cases = [[0,36,12],[64,28,1],....
此对象中的每个列表代表属于一个进程的行。现在我想将 ProcessID 分配给数据帧的相应行。目前我使用 for 循环实现了这一点:
moc = df.iloc
processID = 0
for process in cases:
for step in process:
moc[process,-1] = processID
processID += 1
尽管这可行,但遍历 for 循环需要很长时间,所以我对分配 processID 的更有效方法很感兴趣。
由于我需要遍历 cases 对象,并且由于嵌套列表的长度不同,我不知道如何实现更高效的流程,例如 df.apply() 或 np.where() .
感谢任何帮助。
示例:
import pandas as pd
import numpy as np
cases = [[1,4,2],[3,5,0],[9,6],[7,8]]
d = {'col1': ["some_information", "some_information","some_information",
"some_information","some_information","some_information",
"some_information","some_information","some_information",
"some_information"],
'processID':np.empty}
df = pd.DataFrame(data=d)
print(df)
col1 processID
0 some_information <built-in function empty>
1 some_information <built-in function empty>
2 some_information <built-in function empty>
3 some_information <built-in function empty>
4 some_information <built-in function empty>
5 some_information <built-in function empty>
6 some_information <built-in function empty>
7 some_information <built-in function empty>
8 some_information <built-in function empty>
9 some_information <built-in function empty>
moc = df.iloc
processID = 1
for case in cases:
for idx in case:
moc[idx,-1] = processID
processID += 1
print(df)
col1 processID
0 some_information 2
1 some_information 1
2 some_information 1
3 some_information 2
4 some_information 1
5 some_information 2
6 some_information 3
7 some_information 4
8 some_information 4
9 some_information 3
IIUC,这是一个使用字典理解和 Index.repeat
and numpy.hstack
来创建 pandas.Series
的解决方案,您可以使用它来更新 DataFrame
。这样做的好处是没有循环。
s = pd.Series({(i+1):x for i, x in enumerate(cases)})
processes = pd.Series(s.index.repeat(s.str.len()), index=np.hstack(s))
根据您的示例 cases
,这将创建一个 Series
'processes',例如:
1 1
4 1
2 1
3 2
5 2
0 2
9 3
6 3
7 4
8 4
然后你可以分配到你的 DataFrame
:
df['processID'] = processes
测试性能
设置 - 创建一个长度为 100,000 的 DataFrame 和随机 cases
列表:
idx = pd.Series(np.arange(100000)).sample(frac=1).values.tolist()
cases = [idx[i:i + 3] for i in range(0, len(idx), 3)]
df=pd.DataFrame({'col1':np.arange(100000),
'col2':['some_data']*100000})
时机
%%timeit
s = pd.Series({(i+1):x for i, x in enumerate(cases)}).to_frame()
processes = pd.Series(s.index.repeat(s[0].str.len()), index=np.hstack(s[0]))
df['processID'] = processes
92.2 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
我得到了一个基本上由嵌套列表组成的生成器对象。 它包含大约 20.000 个列表,结构如下所示:
cases = [[0,36,12],[64,28,1],....
此对象中的每个列表代表属于一个进程的行。现在我想将 ProcessID 分配给数据帧的相应行。目前我使用 for 循环实现了这一点:
moc = df.iloc
processID = 0
for process in cases:
for step in process:
moc[process,-1] = processID
processID += 1
尽管这可行,但遍历 for 循环需要很长时间,所以我对分配 processID 的更有效方法很感兴趣。
由于我需要遍历 cases 对象,并且由于嵌套列表的长度不同,我不知道如何实现更高效的流程,例如 df.apply() 或 np.where() .
感谢任何帮助。
示例:
import pandas as pd
import numpy as np
cases = [[1,4,2],[3,5,0],[9,6],[7,8]]
d = {'col1': ["some_information", "some_information","some_information",
"some_information","some_information","some_information",
"some_information","some_information","some_information",
"some_information"],
'processID':np.empty}
df = pd.DataFrame(data=d)
print(df)
col1 processID
0 some_information <built-in function empty>
1 some_information <built-in function empty>
2 some_information <built-in function empty>
3 some_information <built-in function empty>
4 some_information <built-in function empty>
5 some_information <built-in function empty>
6 some_information <built-in function empty>
7 some_information <built-in function empty>
8 some_information <built-in function empty>
9 some_information <built-in function empty>
moc = df.iloc
processID = 1
for case in cases:
for idx in case:
moc[idx,-1] = processID
processID += 1
print(df)
col1 processID
0 some_information 2
1 some_information 1
2 some_information 1
3 some_information 2
4 some_information 1
5 some_information 2
6 some_information 3
7 some_information 4
8 some_information 4
9 some_information 3
IIUC,这是一个使用字典理解和 Index.repeat
and numpy.hstack
来创建 pandas.Series
的解决方案,您可以使用它来更新 DataFrame
。这样做的好处是没有循环。
s = pd.Series({(i+1):x for i, x in enumerate(cases)})
processes = pd.Series(s.index.repeat(s.str.len()), index=np.hstack(s))
根据您的示例 cases
,这将创建一个 Series
'processes',例如:
1 1
4 1
2 1
3 2
5 2
0 2
9 3
6 3
7 4
8 4
然后你可以分配到你的 DataFrame
:
df['processID'] = processes
测试性能
设置 - 创建一个长度为 100,000 的 DataFrame 和随机 cases
列表:
idx = pd.Series(np.arange(100000)).sample(frac=1).values.tolist()
cases = [idx[i:i + 3] for i in range(0, len(idx), 3)]
df=pd.DataFrame({'col1':np.arange(100000),
'col2':['some_data']*100000})
时机
%%timeit
s = pd.Series({(i+1):x for i, x in enumerate(cases)}).to_frame()
processes = pd.Series(s.index.repeat(s[0].str.len()), index=np.hstack(s[0]))
df['processID'] = processes
92.2 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)