pandas numpy : 在数学运算时使用序列设置数组元素

Question

我有一个名为df4的df，你可以通过以下代码购买它：

df4s = """
contract    RB  BeginDate   ValIssueDate    EndDate Valindex0   48  46  47  49  50
2   A00118  46  19850100    19880901    99999999    50  1   2   3   7   7
3   A00118  47  19000100    19880901    19831231    47  1   2   3   7   7
5   A00118  47  19850100    19880901    99999999    50  1   2   3   7   7
6   A00253  48  19000100    19820101    19811231    47  1   2   3   7   7
7   A00253  48  19820100    19820101    19841299    47  1   2   3   7   7
8   A00253  48  19850100    19820101    99999999    50  1   2   3   7   7
9   A00253  50  19000100    19820101    19781231    47  1   2   3   7   7
10  A00253  50  19790100    19820101    19841299    47  1   2   3   7   7
11  A00253  50  19850100    19820101    99999999    50  1   2   3   7   7

"""

df4 = pd.read_csv(StringIO(df4s.strip()), sep='\s+', 
                  dtype={"RB": int, "BeginDate": int, "EndDate": int,'ValIssueDate':int,'Valindex0':int})

输出将是：

contract    RB  BeginDate   ValIssueDate    EndDate Valindex0   48  46  47  49  50
2   A00118  46  19850100    19880901    99999999    50  1   2   3   7   7
3   A00118  47  19000100    19880901    19831231    47  1   2   3   7   7
5   A00118  47  19850100    19880901    99999999    50  1   2   3   7   7
6   A00253  48  19000100    19820101    19811231    47  1   2   3   7   7
7   A00253  48  19820100    19820101    19841299    47  1   2   3   7   7
8   A00253  48  19850100    19820101    99999999    50  1   2   3   7   7
9   A00253  50  19000100    19820101    19781231    47  1   2   3   7   7
10  A00253  50  19790100    19820101    19841299    47  1   2   3   7   7
11  A00253  50  19850100    19820101    99999999    50  1   2   3   7   7

我正在尝试按照逻辑构建一个新列，新列的值将基于 2 个现有列的值：

def test(RB):
    n=1
    for i in np.arange(RB,50):
        n = n * df4[str(i)].values
    return  n


vfunc=np.vectorize(test)
df4['n']=vfunc(df4['RB'].values)

然后收到错误：

    res = array(outputs, copy=False, subok=True, dtype=otypes[0])

ValueError: setting an array element with a sequence.

Answer 1

重建数据框（感谢使用 StringIO 方法）

In [82]: df4['RB'].values
Out[82]: array([46, 47, 47, 48, 48, 48, 50, 50, 50])
In [83]: test(46)
Out[83]: array([42, 42, 42, 42, 42, 42, 42, 42, 42])
In [84]: test(50)
Out[84]: 1
In [85]: [test(i) for i in df4['RB'].values]
Out[85]: 
[array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
 array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
 array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
 array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
 array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
 array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
 1,
 1,
 1]
In [86]: vfunc=np.vectorize(test)
In [87]: vfunc(df4['RB'].values)
TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "<ipython-input-87-8db8cd5dc5ab>", line 1, in <module>
    vfunc(df4['RB'].values)
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2163, in __call__
    return self._vectorize_call(func=func, args=vargs)
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2249, in _vectorize_call
    res = asanyarray(outputs, dtype=otypes[0])
ValueError: setting an array element with a sequence.

注意完整的回溯。 vectorize 从这组混合大小的数组创建 return 数组时遇到问题。它'猜测, based on a trial calculation that it should return an int` dtype。

如果我们告诉它 return 一个对象 dtype 数组，我们得到：

In [88]: vfunc=np.vectorize(test, otypes=['object'])
In [89]: vfunc(df4['RB'].values)
Out[89]: 
array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)

我们可以将其分配给 df 列：

In [90]: df4['n']=_
In [91]: df4
Out[91]: 
   contract  RB  BeginDate  ...  49  50                                     n
2    A00118  46   19850100  ...   7   7  [42, 42, 42, 42, 42, 42, 42, 42, 42]
3    A00118  47   19000100  ...   7   7  [21, 21, 21, 21, 21, 21, 21, 21, 21]
5    A00118  47   19850100  ...   7   7  [21, 21, 21, 21, 21, 21, 21, 21, 21]
6    A00253  48   19000100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
7    A00253  48   19820100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
8    A00253  48   19850100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
9    A00253  50   19000100  ...   7   7                                     1
10   A00253  50   19790100  ...   7   7                                     1
11   A00253  50   19850100  ...   7   7                                     1

我们也可以分配 Out[85] 列表

df4['n']=Out[85]

时间差不多：

In [94]: timeit vfunc(df4['RB'].values)
211 µs ± 5.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [95]: timeit [test(i) for i in df4['RB'].values]
217 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

通常vectorize比较慢，但是test本身可能就够慢了，迭代方法也没什么区别。请记住（如有必要请重新阅读文档），vectorize 不是性能工具。它不会 'compile' 您的函数或使其运行更快。

returning 对象 dtype 数组的替代方法：

In [96]: vfunc=np.frompyfunc(test,1,1)
In [97]: vfunc(df4['RB'].values)
Out[97]: 
array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)
In [98]: timeit vfunc(df4['RB'].values)
202 µs ± 6.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

pandas numpy : 在数学运算时使用序列设置数组元素

pandas numpy : setting an array element with a sequence while math operation

python

numpy

dataframe

pandas

numpy-ndarray