考拉 groupby -> 应用 returns 'cannot insert "key", already exists'

koalas groupby -> apply returns 'cannot insert "key", already exists'

我一直在为这个问题苦苦挣扎,一直未能解决,我得到了当前的数据帧:

import databricks.koalas as ks

x = ks.DataFrame.from_records(
{'ds': {0: Timestamp('2018-10-06 00:00:00'),
  1: Timestamp('2017-06-08 00:00:00'),
  2: Timestamp('2018-10-22 00:00:00'),
  3: Timestamp('2017-02-08 00:00:00'),
  4: Timestamp('2019-02-03 00:00:00'),
  5: Timestamp('2019-02-26 00:00:00'),
  6: Timestamp('2017-04-15 00:00:00'),
  7: Timestamp('2017-07-02 00:00:00'),
  8: Timestamp('2017-04-04 00:00:00'),
  9: Timestamp('2017-03-20 00:00:00'),
  10: Timestamp('2018-06-09 00:00:00'),
  11: Timestamp('2017-01-15 00:00:00'),
  12: Timestamp('2018-05-07 00:00:00'),
  13: Timestamp('2018-01-17 00:00:00'),
  14: Timestamp('2017-07-11 00:00:00'),
  15: Timestamp('2018-12-17 00:00:00'),
  16: Timestamp('2018-12-05 00:00:00'),
  17: Timestamp('2017-05-22 00:00:00'),
  18: Timestamp('2017-08-13 00:00:00'),
  19: Timestamp('2018-05-21 00:00:00')},
 'store': {0: 81,
  1: 128,
  2: 81,
  3: 128,
  4: 25,
  5: 128,
  6: 11,
  7: 124,
  8: 43,
  9: 25,
  10: 25,
  11: 124,
  12: 124,
  13: 128,
  14: 81,
  15: 11,
  16: 124,
  17: 11,
  18: 167,
  19: 128},
 'stock': {0: 1,
  1: 236,
  2: 3,
  3: 9,
  4: 36,
  5: 78,
  6: 146,
  7: 20,
  8: 12,
  9: 12,
  10: 15,
  11: 25,
  12: 10,
  13: 7,
  14: 0,
  15: 230,
  16: 80,
  17: 6,
  18: 110,
  19: 8},
 'sells': {0: 1.0,
  1: 17.0,
  2: 1.0,
  3: 2.0,
  4: 1.0,
  5: 2.0,
  6: 7.0,
  7: 1.0,
  8: 1.0,
  9: 1.0,
  10: 2.0,
  11: 1.0,
  12: 1.0,
  13: 1.0,
  14: 1.0,
  15: 1.0,
  16: 1.0,
  17: 3.0,
  18: 2.0,
  19: 1.0}}
)

以及我想在 groupby 中使用的这个函数 - 应用:

import numpy as np

def compute_indicator(df):
  return (
    df.copy()
    .assign(
      indicator=lambda x: x['a'] < np.percentile(x['b'], 80)
    )
    .astype(int)
    .fillna(1)
  )

其中 df 是 pandas DataFrame。如果我使用 pandas 进行分组应用,代码将按预期执行:

import pandas as pd
# This runs
a = pd.DataFrame.from_dict(x.to_dict()).groupby('store').apply(compute_indicator)

但是当尝试对考拉进行 运行 相同操作时,出现以下错误:ValueError: cannot insert store, already exists

x.groupby('store').apply(compute_indicator)
# ValueError: cannot insert store, already exists

我无法在 compute_indicator 中使用输入注释,因为某些列不是固定的(它们随数据框一起移动,旨在被其他转换使用)。

考拉中的代码运行怎么办?

至于 Koalas 0.29.0,当 koalas.DataFrame.groupby(keys).apply(f) 第一次运行无类型函数 f 时,它必须推断模式,为此运行 pandas.DataFrame.head(n).groupby(keys).apply(f) .问题是 pandas apply 接收数据帧作为参数,其中 groupby 键作为索引和列(参见此 issue)。

pandas.DataFrame.head(h).groupby(keys).apply(f) 的结果然后转换为 koalas.DataFrame,因此如果 f 不删除 keys 列,则此转换会引发异常,因为重复列名称(参见 issue