如何在 Pandas 数据框中的所有列中广播和分配一系列值?
How to broadcast and assign a series of values across all columns in a Pandas dataframe?
我知道这一定很简单,但我无法弄清楚或找不到关于此的现有答案...
说我有这个数据框...
>>> import pandas as pd
>>> import numpy as np
>>> dates = pd.date_range('20130101', periods=6)
>>> df = pd.DataFrame(np.nan, index=dates, columns=list('ABCD'))
>>> df
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN
2013-01-03 NaN NaN NaN NaN
2013-01-04 NaN NaN NaN NaN
2013-01-05 NaN NaN NaN NaN
2013-01-06 NaN NaN NaN NaN
设置一个系列的值很容易...
>>> df.loc[:, 'A'] = pd.Series([1,2,3,4,5,6], index=dates)
>>> df
A B C D
2013-01-01 1 NaN NaN NaN
2013-01-02 2 NaN NaN NaN
2013-01-03 3 NaN NaN NaN
2013-01-04 4 NaN NaN NaN
2013-01-05 5 NaN NaN NaN
2013-01-06 6 NaN NaN NaN
但是如何使用广播设置所有列的值?
>>> default_values = pd.Series([1,2,3,4,5,6], index=dates)
>>> df.loc[:, :] = default_values
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/billtubbs/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/indexing.py", line 189, in __setitem__
self._setitem_with_indexer(indexer, value)
File "/Users/billtubbs/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/indexing.py", line 651, in _setitem_with_indexer
value=value)
File "/Users/billtubbs/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/internals.py", line 3693, in setitem
return self.apply('setitem', **kwargs)
File "/Users/billtubbs/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/internals.py", line 3581, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/billtubbs/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/internals.py", line 940, in setitem
values[indexer] = value
ValueError: could not broadcast input array from shape (6) into shape (6,4)
除了这些方式:
>>> for s in df:
... df.loc[:, s] = default_values
...
或者:
>>> df.loc[:, :] = np.vstack([default_values]*4).T
更新:
或者:
>>> df.loc[:, :] = default_values.values.reshape(6,1)
你可以用 NumPy 解决这个问题:
nvalues = 6
ncolumns = 4
default_values = np.repeat(np.arange(nvalues), ncolumns).reshape(nvalues, ncolumns)
df.loc[:, :] = default_values
然而,这并没有解决您希望在 Pandas 方面进行广播的问题。我不知道有什么技巧可以做到这一点。
使用 numpy broadcasting
s = pd.Series([1,2,3,4,5,6], index=dates)
df.loc[:,:] = s.values[:,None]
使用索引匹配
df.loc[:] = pd.concat([s]*df.columns.size, axis=1)
最直接的方法已经在Pandas中提供:调用.add
方法并指定要添加新值的方向(轴)。
In [7]: df.fillna(0).add(default_values, axis=0)
Out[7]:
A B C D
2013-01-01 1.0 1.0 1.0 1.0
2013-01-02 2.0 2.0 2.0 2.0
2013-01-03 3.0 3.0 3.0 3.0
2013-01-04 4.0 4.0 4.0 4.0
2013-01-05 5.0 5.0 5.0 5.0
2013-01-06 6.0 6.0 6.0 6.0
注意:在较新的 pandas versions 中,您可以只执行 df.add(default_values, axis=0, fill_value=0)
,基本上是避免链接方法的语法改进。
请注意,如果 pandas 的索引对齐思想适用于此:考虑这种情况,其中新值仅覆盖目标数据帧的 5 行中的 4 行
In [37]: default_values = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
In [38]: df = pd.DataFrame(np.ones(shape=(5,5)) + np.nan, index=['a', 'b', 'c', 'd', 'e'])
In [39]: df.fillna(0).add(default_values, axis=0)
Out[39]:
0 1 2 3 4
a 1.0 1.0 1.0 1.0 1.0
b 2.0 2.0 2.0 2.0 2.0
c 3.0 3.0 3.0 3.0 3.0
d 4.0 4.0 4.0 4.0 4.0
e NaN NaN NaN NaN NaN
新值系列中未找到的行e
变为NaN
我来到这里是为了寻找一种既能创建新列又能为每列(而不是每行)分配一个默认值的解决方案。虽然这不是 OP 所要求的,但我发现此解决方案效果很好。如果合适,请对此发表评论并重定向到特定主题:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.nan, index=dates, columns=list('ABCD'))
default_values = pd.Series([1,2,3,4], index=['A','B','C','D'] ).to_dict()
df = df.assign( **default_values ) # note use of ** notation (kwargs)
In [97]: df
Out[97]:
A B C D
2013-01-01 1 2 3 4
2013-01-02 1 2 3 4
2013-01-03 1 2 3 4
2013-01-04 1 2 3 4
2013-01-05 1 2 3 4
2013-01-06 1 2 3 4
我知道这一定很简单,但我无法弄清楚或找不到关于此的现有答案...
说我有这个数据框...
>>> import pandas as pd
>>> import numpy as np
>>> dates = pd.date_range('20130101', periods=6)
>>> df = pd.DataFrame(np.nan, index=dates, columns=list('ABCD'))
>>> df
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN
2013-01-03 NaN NaN NaN NaN
2013-01-04 NaN NaN NaN NaN
2013-01-05 NaN NaN NaN NaN
2013-01-06 NaN NaN NaN NaN
设置一个系列的值很容易...
>>> df.loc[:, 'A'] = pd.Series([1,2,3,4,5,6], index=dates)
>>> df
A B C D
2013-01-01 1 NaN NaN NaN
2013-01-02 2 NaN NaN NaN
2013-01-03 3 NaN NaN NaN
2013-01-04 4 NaN NaN NaN
2013-01-05 5 NaN NaN NaN
2013-01-06 6 NaN NaN NaN
但是如何使用广播设置所有列的值?
>>> default_values = pd.Series([1,2,3,4,5,6], index=dates)
>>> df.loc[:, :] = default_values
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/billtubbs/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/indexing.py", line 189, in __setitem__
self._setitem_with_indexer(indexer, value)
File "/Users/billtubbs/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/indexing.py", line 651, in _setitem_with_indexer
value=value)
File "/Users/billtubbs/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/internals.py", line 3693, in setitem
return self.apply('setitem', **kwargs)
File "/Users/billtubbs/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/internals.py", line 3581, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/billtubbs/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/internals.py", line 940, in setitem
values[indexer] = value
ValueError: could not broadcast input array from shape (6) into shape (6,4)
除了这些方式:
>>> for s in df:
... df.loc[:, s] = default_values
...
或者:
>>> df.loc[:, :] = np.vstack([default_values]*4).T
更新:
或者:
>>> df.loc[:, :] = default_values.values.reshape(6,1)
你可以用 NumPy 解决这个问题:
nvalues = 6
ncolumns = 4
default_values = np.repeat(np.arange(nvalues), ncolumns).reshape(nvalues, ncolumns)
df.loc[:, :] = default_values
然而,这并没有解决您希望在 Pandas 方面进行广播的问题。我不知道有什么技巧可以做到这一点。
使用 numpy broadcasting
s = pd.Series([1,2,3,4,5,6], index=dates)
df.loc[:,:] = s.values[:,None]
使用索引匹配
df.loc[:] = pd.concat([s]*df.columns.size, axis=1)
最直接的方法已经在Pandas中提供:调用.add
方法并指定要添加新值的方向(轴)。
In [7]: df.fillna(0).add(default_values, axis=0)
Out[7]:
A B C D
2013-01-01 1.0 1.0 1.0 1.0
2013-01-02 2.0 2.0 2.0 2.0
2013-01-03 3.0 3.0 3.0 3.0
2013-01-04 4.0 4.0 4.0 4.0
2013-01-05 5.0 5.0 5.0 5.0
2013-01-06 6.0 6.0 6.0 6.0
注意:在较新的 pandas versions 中,您可以只执行 df.add(default_values, axis=0, fill_value=0)
,基本上是避免链接方法的语法改进。
请注意,如果 pandas 的索引对齐思想适用于此:考虑这种情况,其中新值仅覆盖目标数据帧的 5 行中的 4 行
In [37]: default_values = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
In [38]: df = pd.DataFrame(np.ones(shape=(5,5)) + np.nan, index=['a', 'b', 'c', 'd', 'e'])
In [39]: df.fillna(0).add(default_values, axis=0)
Out[39]:
0 1 2 3 4
a 1.0 1.0 1.0 1.0 1.0
b 2.0 2.0 2.0 2.0 2.0
c 3.0 3.0 3.0 3.0 3.0
d 4.0 4.0 4.0 4.0 4.0
e NaN NaN NaN NaN NaN
新值系列中未找到的行e
变为NaN
我来到这里是为了寻找一种既能创建新列又能为每列(而不是每行)分配一个默认值的解决方案。虽然这不是 OP 所要求的,但我发现此解决方案效果很好。如果合适,请对此发表评论并重定向到特定主题:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.nan, index=dates, columns=list('ABCD'))
default_values = pd.Series([1,2,3,4], index=['A','B','C','D'] ).to_dict()
df = df.assign( **default_values ) # note use of ** notation (kwargs)
In [97]: df
Out[97]:
A B C D
2013-01-01 1 2 3 4
2013-01-02 1 2 3 4
2013-01-03 1 2 3 4
2013-01-04 1 2 3 4
2013-01-05 1 2 3 4
2013-01-06 1 2 3 4