为什么 numpy.dot 的行为与 numpy.matmul 不同,当一个数据帧和一个数组被传递给 multiply 时?
Why numpy.dot acts differently from numpy.matmul when a data-frame and an array are passed to multiply?
各位代码爱好者大家好
尽量简洁:train_features是一个Dataframe,比如15435行×56列,weights_input_to_hidden是一个numpy数组,比如(56, 8)。为什么这两个代码的行为如此不同?
hidden_inputs = np.matmul(train_features, weights_input_to_hidden)
ValueError: Shape of passed values is (15435, 8), indices imply (15435, 56)
同时
hidden_inputs = np.dot(train_features, weights_input_to_hidden)
按预期生成 (15435, 8) 数组!
我知道如何通过传递 dataframe.values 让 np.matmul 工作,但我试图理解其中的原因。我想我的更广泛的问题是始终使用 numpy.dot 不是更安全吗?它有缺点吗?
谢谢
In [331]: df = pd.DataFrame(np.random.randint(0,100,size=(15, 4)), columns=list('ABCD')); arr = np.rand
...: om.normal(loc=0.0, scale=1, size=(4, 8))
In [332]: df
Out[332]:
A B C D
0 89 32 48 98
1 30 82 4 80
2 33 19 37 5
3 52 54 85 59
...
14 56 12 20 95
In [333]: arr
Out[333]:
array([[-1.07529431, -1.81065918, 1.07124769, -1.48496446, 0.30771816,
0.44839516, 0.59052604, 0.13464351],
[ 1.02452572, 0.69012042, 0.93467643, -0.76834263, 0.64154991,
0.8636691 , -0.953216 , 0.39123113],
[-1.07415575, 1.24940955, -0.97997359, 0.64934175, 1.65194464,
0.37485754, 0.61134192, 2.34500983],
[-0.69994102, 0.56102524, 0.14815698, -0.07709728, -0.27005677,
-0.15430032, 0.52354464, 0.42093538]])
dot
:
In [334]: np.dot(df,arr)
Out[334]:
array([[-183.07006635, -24.11268158, 92.73134126, -133.1359306 ,
100.74429239, 70.4163106 , 102.70569257, 178.3148074 ],
[ -8.53962464, 52.14975688, 116.71356194, -111.12344468,
46.84187465, 73.42812559, -16.11899202, 79.17512792],
[ -59.26219127, 2.39381449, 17.59178789, -39.96217878,
82.11581512, 44.30498042, 26.61362943, 100.74666767],
...
32.31431554, 28.31277828, 83.59444567, 99.12386781]])
In [335]: _.shape
Out[335]: (15, 8)
具有完整回溯的 matmul:
In [336]: np.matmul(df,arr)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
1653 blocks = [
-> 1654 make_block(values=blocks[0], placement=slice(0, len(axes[0])))
1655 ]
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype)
3046
-> 3047 return klass(values, ndim=ndim, placement=placement)
3048
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
124 raise ValueError(
--> 125 f"Wrong number of items passed {len(self.values)}, "
126 f"placement implies {len(self.mgr_locs)}"
ValueError: Wrong number of items passed 8, placement implies 4
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-336-efb905aa1e5d> in <module>
----> 1 np.matmul(df,arr)
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __array_wrap__(self, result, context)
1916 return result
1917 d = self._construct_axes_dict(self._AXIS_ORDERS, copy=False)
-> 1918 return self._constructor(result, **d).__finalize__(self)
1919
1920 # ideally we would define this to avoid the getattr checks, but
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
462 mgr = init_dict({data.name: data}, index, columns, dtype=dtype)
463 else:
--> 464 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
465
466 # For data is list-like, or Iterable (will consume into list)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
208 block_values = [values]
209
--> 210 return create_block_manager_from_blocks(block_values, [columns, index])
211
212
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
1662 blocks = [getattr(b, "values", b) for b in blocks]
1663 tot_items = sum(b.shape[0] for b in blocks)
-> 1664 construction_error(tot_items, blocks[0].shape[1:], axes, e)
1665
1666
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
1692 if block_shape[0] == 0:
1693 raise ValueError("Empty data passed with indices specified.")
-> 1694 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
1695
1696
ValueError: Shape of passed values is (15, 8), indices imply (15, 4)
正如我所怀疑的那样,matmul
已将部分任务委托给 pandas
,如 pandas issues
:
所示
https://github.com/pandas-dev/pandas/issues/26650
他们说 pandas
有自己的 __matmul__
,所以运算符版本有效:
In [337]: df@arr
Out[337]:
0 1 2 3 4 5 6 7
0 -183.070066 -24.112682 92.731341 -133.135931 100.744292 70.416311 102.705693 178.314807
1 -8.539625 52.149757 116.713562 -111.123445 46.841875 73.428126 -16.118992 79.175128
2 -59.262191 2.393814 17.591788 -39.962179 82.115815 44.304980 26.613629 100.746668
3 -133.190674 82.412526 31.620913 -68.063345 175.126984 92.713852 62.086887 252.288966
4 -117.745733 8.202933 15.816666 -53.176516 124.940168 59.727122 61.493886 168.665106
....
14 -135.899685 -14.829880 65.681429 -86.715528 32.314316 28.312778 83.594446 99.123868
In [338]: _.shape
Out[338]: (15, 8)
这调用:df.__matmul__(arr)
。请注意,结果是数据框,而不是数组。
总之,使用运算符 @
而不是 np.matmul
。或者自己做数据框到数组的转换:
np.matmul(df.to_numpy(),arr)
一些时间:
In [343]: timeit np.matmul(df.to_numpy(),arr)
24.9 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [344]: timeit np.dot(df,arr)
50 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [345]: timeit df.__matmul__(arr)
180 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [346]: timeit np.dot(df.to_numpy(),arr)
21.3 µs ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
如果您只对数组结果感兴趣,那么自己转换数据框是个好主意。
各位代码爱好者大家好
尽量简洁:train_features是一个Dataframe,比如15435行×56列,weights_input_to_hidden是一个numpy数组,比如(56, 8)。为什么这两个代码的行为如此不同?
hidden_inputs = np.matmul(train_features, weights_input_to_hidden)
ValueError: Shape of passed values is (15435, 8), indices imply (15435, 56)
同时
hidden_inputs = np.dot(train_features, weights_input_to_hidden)
按预期生成 (15435, 8) 数组!
我知道如何通过传递 dataframe.values 让 np.matmul 工作,但我试图理解其中的原因。我想我的更广泛的问题是始终使用 numpy.dot 不是更安全吗?它有缺点吗?
谢谢
In [331]: df = pd.DataFrame(np.random.randint(0,100,size=(15, 4)), columns=list('ABCD')); arr = np.rand
...: om.normal(loc=0.0, scale=1, size=(4, 8))
In [332]: df
Out[332]:
A B C D
0 89 32 48 98
1 30 82 4 80
2 33 19 37 5
3 52 54 85 59
...
14 56 12 20 95
In [333]: arr
Out[333]:
array([[-1.07529431, -1.81065918, 1.07124769, -1.48496446, 0.30771816,
0.44839516, 0.59052604, 0.13464351],
[ 1.02452572, 0.69012042, 0.93467643, -0.76834263, 0.64154991,
0.8636691 , -0.953216 , 0.39123113],
[-1.07415575, 1.24940955, -0.97997359, 0.64934175, 1.65194464,
0.37485754, 0.61134192, 2.34500983],
[-0.69994102, 0.56102524, 0.14815698, -0.07709728, -0.27005677,
-0.15430032, 0.52354464, 0.42093538]])
dot
:
In [334]: np.dot(df,arr)
Out[334]:
array([[-183.07006635, -24.11268158, 92.73134126, -133.1359306 ,
100.74429239, 70.4163106 , 102.70569257, 178.3148074 ],
[ -8.53962464, 52.14975688, 116.71356194, -111.12344468,
46.84187465, 73.42812559, -16.11899202, 79.17512792],
[ -59.26219127, 2.39381449, 17.59178789, -39.96217878,
82.11581512, 44.30498042, 26.61362943, 100.74666767],
...
32.31431554, 28.31277828, 83.59444567, 99.12386781]])
In [335]: _.shape
Out[335]: (15, 8)
具有完整回溯的 matmul:
In [336]: np.matmul(df,arr)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
1653 blocks = [
-> 1654 make_block(values=blocks[0], placement=slice(0, len(axes[0])))
1655 ]
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype)
3046
-> 3047 return klass(values, ndim=ndim, placement=placement)
3048
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
124 raise ValueError(
--> 125 f"Wrong number of items passed {len(self.values)}, "
126 f"placement implies {len(self.mgr_locs)}"
ValueError: Wrong number of items passed 8, placement implies 4
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-336-efb905aa1e5d> in <module>
----> 1 np.matmul(df,arr)
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __array_wrap__(self, result, context)
1916 return result
1917 d = self._construct_axes_dict(self._AXIS_ORDERS, copy=False)
-> 1918 return self._constructor(result, **d).__finalize__(self)
1919
1920 # ideally we would define this to avoid the getattr checks, but
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
462 mgr = init_dict({data.name: data}, index, columns, dtype=dtype)
463 else:
--> 464 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
465
466 # For data is list-like, or Iterable (will consume into list)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
208 block_values = [values]
209
--> 210 return create_block_manager_from_blocks(block_values, [columns, index])
211
212
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
1662 blocks = [getattr(b, "values", b) for b in blocks]
1663 tot_items = sum(b.shape[0] for b in blocks)
-> 1664 construction_error(tot_items, blocks[0].shape[1:], axes, e)
1665
1666
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
1692 if block_shape[0] == 0:
1693 raise ValueError("Empty data passed with indices specified.")
-> 1694 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
1695
1696
ValueError: Shape of passed values is (15, 8), indices imply (15, 4)
正如我所怀疑的那样,matmul
已将部分任务委托给 pandas
,如 pandas issues
:
https://github.com/pandas-dev/pandas/issues/26650
他们说 pandas
有自己的 __matmul__
,所以运算符版本有效:
In [337]: df@arr
Out[337]:
0 1 2 3 4 5 6 7
0 -183.070066 -24.112682 92.731341 -133.135931 100.744292 70.416311 102.705693 178.314807
1 -8.539625 52.149757 116.713562 -111.123445 46.841875 73.428126 -16.118992 79.175128
2 -59.262191 2.393814 17.591788 -39.962179 82.115815 44.304980 26.613629 100.746668
3 -133.190674 82.412526 31.620913 -68.063345 175.126984 92.713852 62.086887 252.288966
4 -117.745733 8.202933 15.816666 -53.176516 124.940168 59.727122 61.493886 168.665106
....
14 -135.899685 -14.829880 65.681429 -86.715528 32.314316 28.312778 83.594446 99.123868
In [338]: _.shape
Out[338]: (15, 8)
这调用:df.__matmul__(arr)
。请注意,结果是数据框,而不是数组。
总之,使用运算符 @
而不是 np.matmul
。或者自己做数据框到数组的转换:
np.matmul(df.to_numpy(),arr)
一些时间:
In [343]: timeit np.matmul(df.to_numpy(),arr)
24.9 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [344]: timeit np.dot(df,arr)
50 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [345]: timeit df.__matmul__(arr)
180 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [346]: timeit np.dot(df.to_numpy(),arr)
21.3 µs ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
如果您只对数组结果感兴趣,那么自己转换数据框是个好主意。