Pandas:检查列开头并分配新列的更有效方法?
Pandas: A more efficient way to check column startswith and assign new column?
我有一个包含 500k 行的数据框,如下所示:
status_code
------------
202
302
403
500
202
.
.
.
------------
我创建了一个新列 'status_code_grp',然后检查每一行是否以“2”开头,我将分配 "status_code_grp" =“200”。对 grp = "300", "400", "500" 重复此操作。
我写过这样的东西:
df2 = pd.DataFrame(np.random.randint(200,599,size=(500000, 1)), columns=['status_code'])
for eachRow in range(len(df)):
if(df['status_code'][eachRow].startswith['2']):
df['status_code_grp'][eachRow] = "2xx"
elif(df['status_code'][eachRow].startswith['3']):
df['status_code_grp'][eachRow] = "3xx"
elif(df['status_code'][eachRow].startswith['4']):
df['status_code_grp'][eachRow] = "4xx"
elif(df['status_code'][eachRow].startswith['5']):
df['status_code_grp'][eachRow] = "5xx"
for 循环需要很长时间才能完成。除了使用上面的代码逐行检查之外,还有其他更有效的方法吗?
除以整数除以 100
并乘以 100
:
df2['status_code_grp'] = df2['status_code'] // 100 * 100
在 numpy 中速度更快,通过 Series.to_numpy
:
将 Series 转换为数组
df2 = pd.DataFrame(np.random.randint(200,599,size=(500000, 1)), columns=['status_code'])
In [381]: %timeit df2['status_code_grp1'] = df2['status_code'] // 100 * 100
12.5 ms ± 935 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [382]: %timeit df2['status_code_grp2'] = df2['status_code'].to_numpy() // 100 * 100
6.62 ms ± 42.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我有一个包含 500k 行的数据框,如下所示:
status_code
------------
202
302
403
500
202
.
.
.
------------
我创建了一个新列 'status_code_grp',然后检查每一行是否以“2”开头,我将分配 "status_code_grp" =“200”。对 grp = "300", "400", "500" 重复此操作。
我写过这样的东西:
df2 = pd.DataFrame(np.random.randint(200,599,size=(500000, 1)), columns=['status_code'])
for eachRow in range(len(df)):
if(df['status_code'][eachRow].startswith['2']):
df['status_code_grp'][eachRow] = "2xx"
elif(df['status_code'][eachRow].startswith['3']):
df['status_code_grp'][eachRow] = "3xx"
elif(df['status_code'][eachRow].startswith['4']):
df['status_code_grp'][eachRow] = "4xx"
elif(df['status_code'][eachRow].startswith['5']):
df['status_code_grp'][eachRow] = "5xx"
for 循环需要很长时间才能完成。除了使用上面的代码逐行检查之外,还有其他更有效的方法吗?
除以整数除以 100
并乘以 100
:
df2['status_code_grp'] = df2['status_code'] // 100 * 100
在 numpy 中速度更快,通过 Series.to_numpy
:
df2 = pd.DataFrame(np.random.randint(200,599,size=(500000, 1)), columns=['status_code'])
In [381]: %timeit df2['status_code_grp1'] = df2['status_code'] // 100 * 100
12.5 ms ± 935 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [382]: %timeit df2['status_code_grp2'] = df2['status_code'].to_numpy() // 100 * 100
6.62 ms ± 42.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)