通过关键列合并一列的中位数 - SFrame / Pandas

Merging median from one column by a key column - SFrame / Pandas

graphlab 中,我有以下 SFrame 调用 train

import graphlab
train = graphlab.read_csv('clean_train.csv')
train.head()

[输出]:

+-------+------------+---------+-----------+
| Store |    Date    |  Sales  | Customers |
+-------+------------+---------+-----------+
|   1   | 2015-07-31 |  5263.0 |   555.0   |
|   2   | 2015-07-31 |  6064.0 |   625.0   |
|   3   | 2015-07-31 |  8314.0 |   821.0   |
|   4   | 2015-07-31 | 13995.0 |   1498.0  |
|   3   | 2015-07-20 |  4822.0 |   559.0   |
|   2   | 2015-07-10 |  5651.0 |   589.0   |
|   4   | 2015-07-11 | 15344.0 |   1414.0  |
|   5   | 2015-07-23 |  8492.0 |   833.0   |
|   2   | 2015-07-19 |  8565.0 |   687.0   |
|   10  | 2015-07-09 |  7185.0 |   681.0   |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]

要获得每家商店的中位数销售额,我可以执行以下操作,使用 graphlab 为每家商店的中位数销售额附加一个新列:

mediansales_perstore = train.groupby('Store', operations={'mediansales': agg.QUANTILE('Sales', 0.5)})
train_stores = train_stores.join(mediansales_perstore, on='Store')
test_stores['mediansales'] = [i[0] for i in test_stores['mediansales']]

代码在 graphlab 中运行,添加了一个新行 mediansales。但是当我尝试将 pandas DataFrame 与代码一起使用时:

mediansales_perstore = train.groupby(['Store'])['Sales'].median()

这会根据 graphlab 代码提取每家商店的中位数销售额,但是当我尝试将其合并回火车矩阵时:

train.join(pd.DataFrame(train.groupby(['Store'])['Sales'].median()), on='Store')

失败并抛出错误:

ValueError                                Traceback (most recent call last)
<ipython-input-15-7b64cb46e386> in <module>()
----> 1 train.join(pd.DataFrame(train.groupby(['Store'])['Sales'].median()), on='Store')

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in join(self, other, on, how, lsuffix, rsuffix, sort)
   4017         # For SparseDataFrame's benefit
   4018         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 4019                                  rsuffix=rsuffix, sort=sort)
   4020 
   4021     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   4031             return merge(self, other, left_on=on, how=how,
   4032                          left_index=on is None, right_index=True,
-> 4033                          suffixes=(lsuffix, rsuffix), sort=sort)
   4034         else:
   4035             if on is not None:

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     36                          right_index=right_index, sort=sort, suffixes=suffixes,
     37                          copy=copy)
---> 38     return op.get_result()
     39 if __debug__:
     40     merge.__doc__ = _merge_doc % '\nleft : DataFrame'

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in get_result(self)
    190 
    191         llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf,
--> 192                                                      rdata.items, rsuf)
    193 
    194         lindexers = {1: left_indexer} if left_indexer is not None else {}

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in items_overlap_with_suffix(left, lsuffix, right, rsuffix)
   3969         if not lsuffix and not rsuffix:
   3970             raise ValueError('columns overlap but no suffix specified: %s' %
-> 3971                              to_rename)
   3972 
   3973         def lrenamer(x):

ValueError: columns overlap but no suffix specified: Index([u'Sales'], dtype='object')

如何使用"Store"作为键使用pandas合并"Sales"列的中位数? graphlab 代码虽然有效。

您可以使用 transform:

在一个阶段完成此操作
>>> train['Median-Sales'] = train.groupby('Store')['Sales'].transform('median')
>>> train
   Store        Date  Sales  Customers  Median-Sales
0      1  2015-07-31   5263        555        5263.0
1      2  2015-07-31   6064        625        6064.0
2      3  2015-07-31   8314        821        6568.0
3      4  2015-07-31  13995       1498       14669.5
4      3  2015-07-20   4822        559        6568.0
5      2  2015-07-10   5651        589        6064.0
6      4  2015-07-11  15344       1414       14669.5
7      5  2015-07-23   8492        833        8492.0
8      2  2015-07-19   8565        687        6064.0
9     10  2015-07-09   7185        681        7185.0

合并错误只是说您在左右框架中有重复的列名,因此您需要提供后缀来区分列或重命名列:

>>> right = train.groupby('Store')['Sales'].median()
>>> right.name = 'Median-Sales'
>>> train.join(right, on='Store')
   Store        Date  Sales  Customers  Median-Sales
0      1  2015-07-31   5263        555        5263.0
1      2  2015-07-31   6064        625        6064.0
2      3  2015-07-31   8314        821        6568.0
3      4  2015-07-31  13995       1498       14669.5
4      3  2015-07-20   4822        559        6568.0
5      2  2015-07-10   5651        589        6064.0
6      4  2015-07-11  15344       1414       14669.5
7      5  2015-07-23   8492        833        8492.0
8      2  2015-07-19   8565        687        6064.0
9     10  2015-07-09   7185        681        7185.0