计算 pandas 数据框选定列的加权和的推荐方法是什么?
What is the recommended way to compute a weighted sum of selected columns of a pandas dataframe?
例如,我想计算以下矩阵的列 'a' 和 'c' 的加权和,权重在字典 w
.[=20= 中定义]
df = pd.DataFrame({'a': [1,2,3],
'b': [10,20,30],
'c': [100,200,300],
'd': [1000,2000,3000]})
w = {'a': 1000., 'c': 10.}
我自己想出了一些选项(见下文),但看起来都有点复杂。这个基本用例没有直接的 pandas 操作吗?类似于 df.wsum(w)
?
我尝试了 pd.DataFrame.dot
,但它引发了一个值错误:
df.dot(pd.Series(w))
# This raises an exception:
# "ValueError: matrices are not aligned"
可以通过为每一列指定权重来避免异常,但这不是我想要的。
w = {'a': 1000., 'b': 0., 'c': 10., 'd': 0. }
df.dot(pd.Series(w)) # This works
如何只计算一部分列的点积?或者,可以在应用点操作之前 select 感兴趣的列,或者利用 pandas/numpy 在计算(按行)总和时忽略 nan
的事实(见下文)。
以下是我自己发现的三种方法:
w = {'a': 1000., 'c': 10.}
# 1) Create a complete lookup W.
W = { c: 0. for c in df.columns }
W.update(w)
ret = df.dot(pd.Series(W))
# 2) Select columns of interest before applying the dot product.
ret = df[list(w.keys())].dot(pd.Series(w))
# 3) Exploit the handling of NaNs when computing the (row-wise) sum
ret = (df * pd.Series(w)).sum(axis=1)
# (df * pd.Series(w)) contains columns full of nans
我是不是漏掉了一个选项?
这是一个无需创建 pd.Series
的选项:
(df.loc[:,w.keys()] * list(w.values())).sum(axis=1)
0 2000.0
1 4000.0
2 6000.0
您可以像第一个示例那样使用 Series,之后只需使用 reindex:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3],
'b': [10,20,30],
'c': [100,200,300],
'd': [1000,2000,3000]})
w = {'a': 1000., 'c': 10.}
print(df.dot(pd.Series(w).reindex(df.columns, fill_value=0)))
输出
0 2000.0
1 4000.0
2 6000.0
dtype: float64
使用 numpy
dot
和值
df[list(w.keys())].values.dot(list(w.values()))
array([2000., 4000., 6000.])
修正了你的错误
df.mul( pd.Series(w),1).sum(axis=1)
0 2000.0
1 4000.0
2 6000.0
dtype: float64
我再次无意中遇到了自己的问题,并对可用答案进行了基准测试。
观察:值得先用零填充不完整的权重向量,而不是先捕获列上的视图,然后对生成的子帧进行点乘。
import pandas as pd
import numpy as np
def benchmark(n_rows, n_cols, n_ws):
print("n_rows:%d, n_cols:%d, n_ws:%d" % (n_rows, n_cols, n_ws))
df = pd.DataFrame(np.random.randn(n_rows, n_cols),
columns=range(n_cols))
w = dict(zip(np.random.choice(np.arange(n_cols), n_ws),
np.random.randn(n_ws)))
w0 = pd.Series(w).reindex(df.columns, fill_value=0).values
# Method 0 (aligned vector w0, reference!)
def fun0(df, w0): return df.values.dot(w0)
# Method 1 (reindex)
def fun1(df, w): return df.dot(pd.Series(w).reindex(df.columns, fill_value=0))
# Method 2 (column view)
def fun2(df, w): return (df.loc[:,w.keys()] * list(w.values())).sum(axis=1)
# Method 3 (column view, faster)
def fun3(df, w): return df.loc[:, w].dot(pd.Series(w))
# Method 4 (column view, numpy)
def fun4(df, w): return df[list(w.keys())].values.dot(list(w.values()))
# Assert equivalence
np.testing.assert_array_almost_equal(fun0(df,w0), fun1(df,w), decimal=10)
np.testing.assert_array_almost_equal(fun0(df,w0), fun2(df,w), decimal=10)
np.testing.assert_array_almost_equal(fun0(df,w0), fun3(df,w), decimal=10)
np.testing.assert_array_almost_equal(fun0(df,w0), fun4(df,w), decimal=10)
print("fun0:", end=" ")
%timeit fun0(df, w0)
print("fun1:", end=" ")
%timeit fun1(df, w)
print("fun2:", end=" ")
%timeit fun2(df, w)
print("fun3:", end=" ")
%timeit fun3(df, w)
print("fun4:", end=" ")
%timeit fun4(df, w)
benchmark(n_rows = 200000, n_cols = 11, n_ws = 3)
benchmark(n_rows = 200000, n_cols = 11, n_ws = 9)
benchmark(n_rows = 200000, n_cols = 31, n_ws = 5)
输出(fun0()
是使用零填充向量w0
的参考):
n_rows:200000, n_cols:11, n_ws:3
fun1: 1.98 ms ± 86.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun2: 9.66 ms ± 32.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun3: 2.68 ms ± 90.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun4: 2.2 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
n_rows:200000, n_cols:11, n_ws:9
fun1: 1.85 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
fun2: 11.7 ms ± 54.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun3: 3.7 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun4: 3.17 ms ± 29.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
n_rows:200000, n_cols:31, n_ws:5
fun1: 3.08 ms ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun2: 13.1 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
fun3: 5.48 ms ± 57 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun4: 4.98 ms ± 49.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我已经用 pandas 1.2.3、numpy 1.20.1 和 Python 3.9.0 进行了测试。在 MacBookPro(2015 年末)上。 (类似的结果适用于旧的 Python 版本)。
例如,我想计算以下矩阵的列 'a' 和 'c' 的加权和,权重在字典 w
.[=20= 中定义]
df = pd.DataFrame({'a': [1,2,3],
'b': [10,20,30],
'c': [100,200,300],
'd': [1000,2000,3000]})
w = {'a': 1000., 'c': 10.}
我自己想出了一些选项(见下文),但看起来都有点复杂。这个基本用例没有直接的 pandas 操作吗?类似于 df.wsum(w)
?
我尝试了 pd.DataFrame.dot
,但它引发了一个值错误:
df.dot(pd.Series(w))
# This raises an exception:
# "ValueError: matrices are not aligned"
可以通过为每一列指定权重来避免异常,但这不是我想要的。
w = {'a': 1000., 'b': 0., 'c': 10., 'd': 0. }
df.dot(pd.Series(w)) # This works
如何只计算一部分列的点积?或者,可以在应用点操作之前 select 感兴趣的列,或者利用 pandas/numpy 在计算(按行)总和时忽略 nan
的事实(见下文)。
以下是我自己发现的三种方法:
w = {'a': 1000., 'c': 10.}
# 1) Create a complete lookup W.
W = { c: 0. for c in df.columns }
W.update(w)
ret = df.dot(pd.Series(W))
# 2) Select columns of interest before applying the dot product.
ret = df[list(w.keys())].dot(pd.Series(w))
# 3) Exploit the handling of NaNs when computing the (row-wise) sum
ret = (df * pd.Series(w)).sum(axis=1)
# (df * pd.Series(w)) contains columns full of nans
我是不是漏掉了一个选项?
这是一个无需创建 pd.Series
的选项:
(df.loc[:,w.keys()] * list(w.values())).sum(axis=1)
0 2000.0
1 4000.0
2 6000.0
您可以像第一个示例那样使用 Series,之后只需使用 reindex:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3],
'b': [10,20,30],
'c': [100,200,300],
'd': [1000,2000,3000]})
w = {'a': 1000., 'c': 10.}
print(df.dot(pd.Series(w).reindex(df.columns, fill_value=0)))
输出
0 2000.0
1 4000.0
2 6000.0
dtype: float64
使用 numpy
dot
和值
df[list(w.keys())].values.dot(list(w.values()))
array([2000., 4000., 6000.])
修正了你的错误
df.mul( pd.Series(w),1).sum(axis=1)
0 2000.0
1 4000.0
2 6000.0
dtype: float64
我再次无意中遇到了自己的问题,并对可用答案进行了基准测试。
观察:值得先用零填充不完整的权重向量,而不是先捕获列上的视图,然后对生成的子帧进行点乘。
import pandas as pd
import numpy as np
def benchmark(n_rows, n_cols, n_ws):
print("n_rows:%d, n_cols:%d, n_ws:%d" % (n_rows, n_cols, n_ws))
df = pd.DataFrame(np.random.randn(n_rows, n_cols),
columns=range(n_cols))
w = dict(zip(np.random.choice(np.arange(n_cols), n_ws),
np.random.randn(n_ws)))
w0 = pd.Series(w).reindex(df.columns, fill_value=0).values
# Method 0 (aligned vector w0, reference!)
def fun0(df, w0): return df.values.dot(w0)
# Method 1 (reindex)
def fun1(df, w): return df.dot(pd.Series(w).reindex(df.columns, fill_value=0))
# Method 2 (column view)
def fun2(df, w): return (df.loc[:,w.keys()] * list(w.values())).sum(axis=1)
# Method 3 (column view, faster)
def fun3(df, w): return df.loc[:, w].dot(pd.Series(w))
# Method 4 (column view, numpy)
def fun4(df, w): return df[list(w.keys())].values.dot(list(w.values()))
# Assert equivalence
np.testing.assert_array_almost_equal(fun0(df,w0), fun1(df,w), decimal=10)
np.testing.assert_array_almost_equal(fun0(df,w0), fun2(df,w), decimal=10)
np.testing.assert_array_almost_equal(fun0(df,w0), fun3(df,w), decimal=10)
np.testing.assert_array_almost_equal(fun0(df,w0), fun4(df,w), decimal=10)
print("fun0:", end=" ")
%timeit fun0(df, w0)
print("fun1:", end=" ")
%timeit fun1(df, w)
print("fun2:", end=" ")
%timeit fun2(df, w)
print("fun3:", end=" ")
%timeit fun3(df, w)
print("fun4:", end=" ")
%timeit fun4(df, w)
benchmark(n_rows = 200000, n_cols = 11, n_ws = 3)
benchmark(n_rows = 200000, n_cols = 11, n_ws = 9)
benchmark(n_rows = 200000, n_cols = 31, n_ws = 5)
输出(fun0()
是使用零填充向量w0
的参考):
n_rows:200000, n_cols:11, n_ws:3
fun1: 1.98 ms ± 86.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun2: 9.66 ms ± 32.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun3: 2.68 ms ± 90.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun4: 2.2 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
n_rows:200000, n_cols:11, n_ws:9
fun1: 1.85 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
fun2: 11.7 ms ± 54.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun3: 3.7 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun4: 3.17 ms ± 29.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
n_rows:200000, n_cols:31, n_ws:5
fun1: 3.08 ms ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun2: 13.1 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
fun3: 5.48 ms ± 57 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun4: 4.98 ms ± 49.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我已经用 pandas 1.2.3、numpy 1.20.1 和 Python 3.9.0 进行了测试。在 MacBookPro(2015 年末)上。 (类似的结果适用于旧的 Python 版本)。