使用 NumPy 向量化 Python 中的一系列 CDF 样本

Question

我正在用 Python 编写一个基本的财务程序，其中日常开支以 table 的形式读入，然后转换为 PDF（概率密度函数），最终转换为 CDF （累积分布函数），范围从 0 到 1，使用 NumPy 的内置直方图功能。我试图通过将范围从 0 到 1 的随机数与 CDF 数组和 CDF 中心点数组进行比较并使用 SciPy 的 interp1d 功能来确定插值来随机抽样每日费用。我已经使用 for 循环成功地实现了这个算法，但它是一种减慢方式，我正在尝试将其转换为矢量化格式。我包括了一个与 for 循环一起工作的代码示例，以及我迄今为止对算法进行矢量化的尝试。对于如何使矢量化版本工作并提高代码执行速度的任何建议，我将不胜感激。

示例输入文件：

12.00    March 01, 2014
0.00     March 02, 2014
0.00     March 03, 2014
0.00     March 04, 2014
0.00     March 05, 2014
0.00     March 06, 2014
44.50    March 07, 2014
0.00     March 08, 2014
346.55   March 09, 2014
168.18   March 10, 2014
140.82   March 11, 2014
10.83    March 12, 2014
0.00     March 13, 2014
0.00     March 14, 2014
174.00   March 15, 2014
0.00     March 16, 2014
0.00     March 17, 2014
266.53   March 18, 2014
0.00     March 19, 2014
110.00   March 20, 2014
0.00     March 21, 2014
0.00     March 22, 2014
44.50    March 23, 2014

for 循环版本的代码（有效但速度太慢）

#!usr/bin/python
import pandas as pd
import numpy as np
import random
import itertools
import scipy.interpolate

def Linear_Interpolation(rand,Array,Array_Center):
    if(rand < Array[0]):
        y_interp = scipy.interpolate.interp1d((0,Array[0]),(0,Array_Center[0]))
    else:
        y_interp = scipy.interpolate.interp1d(Array,Array_Center)

    final_value = y_interp(rand)
    return (final_value)

#--------- Main Program --------------------
# - Reads the file in and transforms the first column of float variables into
#   an array titled MISC_DATA
File1 = '../../Input_Files/Histograms/Static/Misc.txt'
MISC_DATA = pd.read_table(File1,header=None,names = ['expense','month','day','year'],sep = '\s+')

# Creates the PDF bin heights and edges
Misc_hist, Misc_bin_edges = np.histogram(MISC_DATA['expense'],bins=60,normed=True)
# Creates the CDF bin heights
Misc = np.cumsum(Misc_hist*np.diff(Misc_bin_edges))
# Creates an array of the bin center points along the x axis
Misc_Center = (Misc_bin_edges[:-1] + Misc_bin_edges[1:])/2

iterator = range(0,100)
for cycle in iterator:
    MISC_EXPENSE = Linear_Interpolation(random.random(),Misc,Misc_Center)
    print MISC_EXPENSE

我试图以如下所示的方式对 for 循环进行矢量化，并将变量 MISC_EXPENSE 从标量转换为数组，但它不起作用。它告诉我，具有多个元素的数组的真值是不明确的。我认为它指的是随机变量数组 'rand_var' 与数组 'Misc' 和 'Misc_Center' 具有不同的维度。任何建议表示赞赏。

rand_var = np.random.rand(100)
MISC_EXPENSE = Linear_Interpolation(rand_var,Misc,Misc_Center)

Answer 1

如果我理解你的示例是正确的，代码会为每个随机数创建一个插值对象，这很慢。但是，interp1d 可以采用要插值的值向量。在我假设的任何情况下，起始零都应该在 CDF 中：

y_interp = scipy.interpolate.interp1d(
    np.concatenate((np.array([0]), Misc)),
    np.concatenate((np.array([0]), Misc_Center))
)


new_vals = y_interp(np.random.rand(100))

使用 NumPy 向量化 Python 中的一系列 CDF 样本

Vectorizing a series of CDF samples in Python with NumPy

python

numpy

scipy