大数据中的统计微积分设置错误值

Statistical Calculus In Big Data Set Wrong Values

我正在使用以下代码制作一个我打算在大数据集上使用的函数的小示例。 我为每个 ID 增量计算统计特征,其中单位是月。

df = pd.DataFrame([[58685991,'2020-06-01',2],
                   [58685991,'2020-06-01',1],
                   [58685991,'2020-06-01',0],
                   [58685991,'2020-12-05',7],
                   [57839709,'2020-12-01',5],
                   [57839709,'2021-01-08',3]],columns=['ID','DATE','QTD'])

def monthdelta(a,b):
    a1,a2,a3 = (int(k) for k in a.split('-'))
    b1,b2,b3 = (int(k) for k in b.split('-'))
    return (a1*12+a2) - (b1*12+b2)


startdate = {}
sums = {}
sumsqs = {}
num = {}
stdev = []
means = []
total = []
ind_max = []
ind_min = []
ind_maximum = 0
ind_minimum = 0

for row in df.T.iteritems():
    id = row[1]['ID']
    if id not in startdate:
        num[id] = 1
        startdate[id] = row[1]['DATE']
        sums[id] = row[1]['QTD']
        sumsqs[id] = row[1]['QTD'] * row[1]['QTD']
        means.append( row[1]['QTD'] )
        total.append( row[1]['QTD'] )
        stdev.append( 0 )
        ind_maximum = row[1]['QTD']
        ind_minimum = row[1]['QTD']
        ind_min.append( row[1]['QTD'] )
        ind_max.append( row[1]['QTD'] )
    else:
        num[id] += 1
        sums[id] += row[1]['QTD']
        sumsqs[id] += row[1]['QTD'] * row[1]['QTD']
        delta = monthdelta(row[1]['DATE'],startdate[id]) + 1
        means.append( sums[id] / delta )
        total.append( sums[id] )
        if delta == 1:
            stdev.append( 0 )
        else:
            stdev.append( np.sqrt((delta*sumsqs[id] - sums[id]*sums[id])/delta))
        if row[1]['QTD'] > ind_maximum:
            ind_max.append( row[1]['QTD'] )
            ind_maximum = row[1]['QTD']
        else: 
            ind_max.append( ind_maximum )
 
        if row[1]['QTD'] < ind_minimum:
            ind_min.append( row[1]['QTD'] )
            ind_minimum = row[1]['QTD']
        else:
            ind_min.append( ind_minimum )

df['MEAN'] = pd.Series(means)
df['STDEV'] = pd.Series(stdev)
df['TOTAL'] = pd.Series(total)
df['MAX'] = pd.Series(ind_max)
df['MIN'] = pd.Series(ind_min)

代码有效,我得到以下输出:

    ID          DATE       QTD  MEAN        STDEV     TOTAL MAX MIN
0   58685991    2020-06-01  2   2.000000    0.000000    2   2   2
1   58685991    2020-06-01  1   3.000000    0.000000    3   2   1
2   58685991    2020-06-01  0   3.000000    0.000000    3   2   0
3   58685991    2020-12-05  7   1.428571    6.301927    10  7   0
4   57839709    2020-12-01  5   5.000000    0.000000    5   5   5
5   57839709    2021-01-08  3   4.000000    1.414214    8   5   3

我遇到的问题是当我将它应用到大数据集时,一些 ID's 得到错误的特征值我似乎无法理解为什么?有些只有一个 QTD 的条目,但平均值高于 1.0,总数也非常高。其他功能也会出现此问题。

不确定是不是因为我使用了一个系列,然后决定在数据框上制作一列。 有没有一种方法可以通过 .loc.iloc 操纵数据框本身来完成?这是一种更安全的数据处理方式吗?我对他们不太满意,所以如果能提供一个例子就太好了。

这是以矢量化形式实现的相同逻辑(通常在大型数据集上更有效):

# convert DATE to datetime
df['DATE'] = pd.to_datetime(df['DATE'])

# calculate min, max, sum
df[['min', 'max', 'sum']] = (
    df
        .groupby('ID')['QTD']
        .expanding()
        .agg(['min', 'max', 'sum'])
        .reset_index('ID', drop=True))

# calculate delta
df['date_first'] = df.groupby('ID')['DATE'].transform('min')
df['delta'] = (
    (df['DATE'].dt.year - df['date_first'].dt.year) * 12 +
    (df['DATE'].dt.month - df['date_first'].dt.month) + 1)

# calculate sum of squares
df['qtd_sq'] = df['QTD']**2
df['sum_sq'] = df.groupby('ID')['qtd_sq'].cumsum()

# calculate standard deviation
df['stdev'] = np.where(
    df['delta']==1, 0,
    np.sqrt((df['delta']*df['sum_sq'] - df['sum']*df['sum']) / df['delta']))

# calculate means
df['means'] = df['sum'] / df['delta']

# drop temp columns
df = df.drop(columns=['delta', 'qtd_sq', 'sum_sq', 'date_first'])

df

输出:

         ID       DATE  QTD  min  max   sum     stdev     means
0  58685991 2020-06-01    2  2.0  2.0   2.0  0.000000  2.000000
1  58685991 2020-06-01    1  1.0  2.0   3.0  0.000000  3.000000
2  58685991 2020-06-01    0  0.0  2.0   3.0  0.000000  3.000000
3  58685991 2020-12-05    7  0.0  7.0  10.0  6.301927  1.428571
4  57839709 2020-12-01    5  5.0  5.0   5.0  0.000000  5.000000
5  57839709 2021-01-08    3  3.0  5.0   8.0  1.414214  4.000000