pandas 更好的运行时间,遍历数据框

pandas better runtime, going trough dataframe

我有一个 pandas 数据框,我想在一列中搜索数字,找到它并将其放入新列中。

import pandas
import regex as re
import numpy as np
data = {'numbers':['134.ABBC,189.DREB, 134.TEB', '256.EHBE, 134.RHECB, 345.DREBE', '456.RHN,256.REBN,864.TREBNSE', '256.DREB, 134.ETNHR,245.DEBHTECM'],
        'rate':[434, 456, 454256, 2334544]}
  

df = pd.DataFrame(data)
  

print(df)

pattern = '134.[A-Z]{2,}'


df['mynumbers'] = None


index_numbers = df.columns.get_loc('numbers')
index_mynumbers = df.columns.get_loc('mynumbers')


length = np.array([])


for row in range(0, len(df)):
    number = re.findall(pattern, df.iat[row, index_numbers])

    
    
    df.iat[row, index_mynumbers] = number


print(df)


我得到我的号码:{'mynumbers': ['[134.ABBC, 134.TEB]', '[134.RHECB]', '[134.RHECB]']}。我的数据框很大。 pandas 中是否有更好、更快的方法来通过我的 df?

当然,使用 Series.str.findall 代替循环:

pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].str.findall(pattern)
print(df)
                            numbers     rate            mynumbers
0        134.ABBC,189.DREB, 134.TEB      434  [134.ABBC, 134.TEB]
1    256.EHBE, 134.RHECB, 345.DREBE      456          [134.RHECB]
2      456.RHN,256.REBN,864.TREBNSE   454256                   []
3  256.DREB, 134.ETNHR,245.DEBHTECM  2334544          [134.ETNHR]

如果想使用 re.findall 是否可行,只慢 2 倍:

pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].map(lambda x: re.findall(pattern, x))

# [40000 rows]
df = pd.concat([df] * 10000, ignore_index=True)

pattern = '134.[A-Z]{2,}'

In [46]: %timeit df['numbers'].map(lambda x: re.findall(pattern, x))
50 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [47]: %timeit df['numbers'].str.findall(pattern)
21.2 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)