更有效的方法是将 pandas 数据框中的列子集居中并保留列名

Question

我有一个大约有 370 列的数据框。我正在测试一系列假设，这些假设要求我使用模型的子集来拟合三次回归模型。我打算使用 statsmodels 来模拟这些数据。

多项式回归过程的一部分涉及均值居中变量（从特定特征的每个案例中减去均值）。

我可以用 3 行代码完成此操作，但它似乎效率低下，因为我需要为六个假设复制此过程。请记住，我需要从 statsmodel 输出的系数级别获取数据，因此我需要保留列名。

查看数据。这是我的一个假设检验所需的列子集。

      i  we  you  shehe  they  ipron
0  0.51   0    0   0.26  0.00   1.02
1  1.24   0    0   0.00  0.00   1.66
2  0.00   0    0   0.00  0.72   1.45
3  0.00   0    0   0.00  0.00   0.53

这是表示居中并保留列名的代码。

from sklearn import preprocessing
#create df of features for hypothesis, from full dataframe
h2 = df[['i', 'we', 'you', 'shehe', 'they', 'ipron']]

#center the variables
x_centered = preprocessing.scale(h2, with_mean='True', with_std='False')

#convert back into a Pandas dataframe and add column names
x_centered_df = pd.DataFrame(x_centered, columns=h2.columns)

任何关于如何提高效率/速度的建议都很棒！

Answer 1

df.apply(lambda x: x-x.mean())

%timeit df.apply(lambda x: x-x.mean())
1000 loops, best of 3: 2.09 ms per loop

df.subtract(df.mean())

%timeit df.subtract(df.mean())
1000 loops, best of 3: 902 µs per loop

两者都产生：

        i  we  you  shehe  they  ipron
0  0.0725   0    0  0.195 -0.18 -0.145
1  0.8025   0    0 -0.065 -0.18  0.495
2 -0.4375   0    0 -0.065  0.54  0.285
3 -0.4375   0    0 -0.065 -0.18 -0.635

Answer 2

我知道这个问题有点老了，但现在 Scikit 是最快的解决方案。另外，您可以将代码压缩在一行中：

pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)

%timeit pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)
684 µs ± 30.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


test.subtract(df.mean())

%timeit df.subtract(df.mean())
1.63 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

我用来测试的df:

df = pd.DataFrame(np.random.randint(low=1, high=10, size=(20,5)),columns = list('abcde'))

更有效的方法是将 pandas 数据框中的列子集居中并保留列名

More efficient way to mean center a sub-set of columns in a pandas dataframe and retain column names

python

machine-learning

pandas

scikit-learn

statsmodels