按组插值 - 如何提高性能

Interpolate groupwise - how to improve performance

请考虑以下 df:

import pandas as pd
data = {'year':  [2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011],
        'bfsId': [1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2],
         'income': [15000, 20000, 25000, 30000, 15000, 20000, 25000, 30000, 15000, 20000, 25000, 30000, 15000, 20000, 25000, 30000],
         'taxes_perc': [0.74, 1.715, 3.192, 4.09, 0.813333333, 1.905, 3.548, 4.543333333, 0.753333333, 1.775, 3.308, 4.183333333, 0.813333333, 1.94, 3.608, 4.563333333],
         'perc_inc': [17375, 23625, 33875, 33875, 17375, 23625, 33875, 33875, 17375, 23500, 33625, 33625, 17375, 23500, 33625, 33625]
        }

df = pd.DataFrame(data)

我想为每一年申请scipy.interpolate.interp1d,并分别申请bfsId。 我想出了一个循环来完成我想要的。 不幸的是,性能似乎很差。 问题是在我的真实数据中我有20多年和2000多个bfsIds(而且我有大约20个数据集)。

这是我的循环:

import scipy.interpolate 
df_interpol = pd.DataFrame()

for j in range(2010, 2012):
    df_jahr = df[(df.year == j)]
    for i in df_jahr.bfsId.unique():
        df_jahr_gem = df_jahr[df_jahr.bfsId == i].copy()
        y = df_jahr_gem.taxes_perc
        x = df_jahr_gem.income
        y_interp = scipy.interpolate.interp1d(x, y, fill_value="extrapolate")
        df_jahr_gem['tax_rate_interpol'] = pd.Series(y_interp(df_jahr_gem.perc_inc)).values
        df_interpol = df_interpol.append(df_jahr_gem)

关于如何重写代码的任何想法(可能使用 groupby 和使用函数,但我无法实现它)。

您可以使用 groupbyapply 方法。

代码:

import pandas as pd
import scipy.interpolate 

# Create a sample dataframe
df = pd.DataFrame({'year': [2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011], 'bfsId': [1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2], 'income': [15000, 20000, 25000, 30000, 15000, 20000, 25000, 30000, 15000, 20000, 25000, 30000, 15000, 20000, 25000, 30000], 'taxes_perc': [0.74, 1.715, 3.192, 4.09, 0.813333333, 1.905, 3.548, 4.543333333, 0.753333333, 1.775, 3.308, 4.183333333, 0.813333333, 1.94, 3.608, 4.563333333], 'perc_inc': [17375, 23625, 33875, 33875, 17375, 23625, 33875, 33875, 17375, 23500, 33625, 33625, 17375, 23500, 33625, 33625] })

# Ensure that the dataframe is sorted by the group keys
df = df.sort_values(['year', 'bfsId'])

# Define the interpolating function
def f(df):
    y_interp = scipy.interpolate.interp1d(df.income, df.taxes_perc, fill_value="extrapolate")
    return y_interp(df.perc_inc)

# Carry out the interpolation on each group
df['tax_rate_interpol'] = df.groupby(['year', 'bfsId']).apply(f).apply(pd.Series).stack().values

print(df)

输出:

year bfsId income taxes_perc perc_inc tax_rate_interpol
0 2010 1 15000 0.74 17375 1.20312
1 2010 1 20000 1.715 23625 2.78582
2 2010 1 25000 3.192 33875 4.78595
3 2010 1 30000 4.09 33875 4.78595
4 2010 2 15000 0.813333 17375 1.33187
5 2010 2 20000 1.905 23625 3.09618
6 2010 2 25000 3.548 33875 5.31472
7 2010 2 30000 4.54333 33875 5.31472
8 2011 1 15000 0.753333 17375 1.23862
9 2011 1 20000 1.775 23500 2.8481
10 2011 1 25000 3.308 33625 4.81795
11 2011 1 30000 4.18333 33625 4.81795
12 2011 2 15000 0.813333 17375 1.3485
13 2011 2 20000 1.94 23500 3.1076
14 2011 2 25000 3.608 33625 5.25595
15 2011 2 30000 4.56333 33625 5.25595