Pandas

Question

我有一个值的数据框：

df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
            a         b
1    0.277438  0.042671
..        ...       ...
499  0.570952  0.865869

[500 rows x 2 columns]

我想通过用它们的百分位数替换值来转换它，其中百分位数接管所有值的分布在前面的行。即，如果您执行 df.T.unstack()，它将是一个纯扩展样本。如果您将索引视为 DatetimeIndex，这可能会更直观，我要求在整个横截面历史记录中采用扩展百分位数。

所以目标是这个人：

      a   b
0    99  99
..   ..  ..
499  58  84

(理想情况下 我想在和之前的所有行中的所有值的集合上分配一个值，包括那一行，所以不完全是一个扩展的百分位数；但如果我们不能得到它，那也没关系。）

我有一个真的丑陋的方法来做到这一点，我转置和拆分数据帧，生成百分位数掩码，然后使用 for 循环将该掩码覆盖在数据帧上获取百分位数：

percentile_boundaries_over_time = pd.DataFrame({integer: 
                                     pd.expanding_quantile(df.T.unstack(), integer/100.0) 
                                     for integer in range(0,101,1)})

percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)

for integer in range(0,100,1):
    percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
                    (df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer

我一直在尝试使用 scipy.stats.percentileofscore() 和 pd.expanding_apply() 使某些东西更快地工作，但它没有给出正确的输出，我正在努力让自己发疯找出原因。这就是我一直在玩的东西：

perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))

有没有人想过为什么这会给出错误的输出？或者更快的方法来完成整个练习？非常感谢任何帮助！

Answer 1

这是实现您的 'percentile over the set of all values in all rows before and including that row' 要求的尝试。 stats.percentileofscore 似乎在给定 2D 数据时起作用，因此 squeezeing 似乎有助于获得正确的结果：

a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)

for current_index in df.index:
    preceding_rows = df.loc[:current_index, :]
    # Combine values from all columns into a single 1D array
    #   * 2 should be * N if you have N columns
    combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
    a_percentile[current_index] = stats.percentileofscore(
        combined, 
        df.loc[current_index, 'a'], 
        kind='weak'
    )
    b_percentile[current_index] = stats.percentileofscore(
        combined, 
        df.loc[current_index, 'b'], 
        kind='weak'
    )

Answer 2

还不是很清楚，但是你想要一个累计总和除以总数吗？

norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm

同上 b

Answer 3

正如其他几位评论者指出的那样，计算每行的百分位数可能涉及每次对数据进行排序。这可能是任何当前预打包解决方案的情况，包括 pd.DataFrame.rank 或 scipy.stats.percentileofscore。重复排序是浪费和计算密集型的，所以我们想要一个最小化这种情况的解决方案。

退后一步，找到一个值相对于现有数据集的反分位数类似于找到我们将该值插入到数据集中的位置（如果它已排序）。问题是我们还有一组不断扩展的数据。值得庆幸的是，一些排序算法在处理大部分已排序的数据（并插入少量未排序的元素）时速度非常快。因此，我们的策略是维护我们自己的排序数据数组，并在每次行迭代时将其添加到我们现有的列表中，并查询它们在新扩展的排序集中的位置。鉴于数据已排序，后一个操作也很快。

我认为 insertion sort 将是最快的排序，但它在 Python 中的性能可能比任何原生 NumPy 排序都慢。合并排序似乎是 NumPy 中可用选项中最好的。一个理想的解决方案是编写一些 Cython，但是将我们的上述策略与 NumPy 结合使用可以帮助我们完成大部分工作。

这是一个手卷解决方案：

def quantiles_by_row(df):
    """ Reconstruct a DataFrame of expanding quantiles by row """

    # Construct skeleton of DataFrame what we'll fill with quantile values
    quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)

    # Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
    num_valid = np.sum(~np.isnan(df.values))
    sorted_array = np.empty(num_valid)

    # We want to maintain that sorted_array[:length] has data and is sorted
    length = 0

    # Iterates over ndarray rows
    for i, row_array in enumerate(df.values):

        # Extract non-NaN numpy array from row
        row_is_nan = np.isnan(row_array)
        add_array = row_array[~row_is_nan]

        # Add new data to our sorted_array and sort.
        new_length = length + len(add_array)
        sorted_array[length:new_length] = add_array
        length = new_length
        sorted_array[:length].sort(kind="mergesort")

        # Query the relative positions, divide by length to get quantiles
        quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length

        # Insert values into quantile_df
        quantile_df.iloc[i][~row_is_nan] = quantile_row

    return quantile_df

根据 bhalperin 提供的数据（离线），此解决方案的速度提高了 10 倍。

最后一条评论：np.searchsorted 有 'left' 和 'right' 的选项，它们决定了您希望您的预期插入位置是第一个还是最后一个可能的合适位置。如果您的数据中有很多重复项，这很重要。上述解决方案的更准确版本将取 'left' 和 'right':

的平均值

# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)

Pandas - 展开反分位数函数

Pandas - expanding inverse quantile function

python

scipy

percentile