平衡面板数据以进行回归

Balancing a panel data for regression

我有一个数据框:

df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 3], "city": ['abc', 'abc', 'abc', 'def10', 'def10', 'ghk'] ,"year": [2008, 2009, 2010, 2008, 2010,2009], "value": [10,20,30,10,20,30]})

    id  city     year  value
0   1    abc    2008    10 
1   1    abc    2009    20
2   1    abc    2010    30
3   2   def10   2008    10
4   2   def10   2010    20
5   3   ghk     2009    30

我想创建一个平衡数据:

    id  city     year  value
0   1    abc    2008    10 
1   1    abc    2009    20
2   1    abc    2010    30
3   2   def10   2008    10
4   2   def10   2009    NaN
5   2   def10   2010    20
6   3   ghk     2008    NaN
7   3   ghk     2009    30
8   3   ghk     2009    NaN

如果我使用以下代码:

df = df.set_index('id')
balanced = (id.set_index('year',append=True).reindex(pd.MultiIndex.from_product([df.index,range(df.year.min(),df.year.max()+1)],names=['frs_id','year'])).reset_index(level=1))

这给了我以下错误:

cannot handle a non-unique multi-index!

旋转 table 并堆叠 year 而不丢弃 NaN 值:

>>> df.pivot(["id", "city"], "year", "value") \
      .stack(dropna=False) \
      .rename("value") \
      .reset_index()

   id   city  year  value
0   1    abc  2008   10.0
1   1    abc  2009   20.0
2   1    abc  2010   30.0
3   2  def10  2008   10.0
4   2  def10  2009    NaN
5   2  def10  2010   20.0
6   3    ghk  2008    NaN
7   3    ghk  2009   30.0
8   3    ghk  2010    NaN

编辑:重复条目的情况

我稍微修改了你的原始数据框:

df = pd.DataFrame({"id": [1,1,1,2,2,3,3], "city": ['abc','abc','abc','def10','def10','ghk','ghk'], "year": [2008,2009,2010,2008,2010,2009,2009], "value": [10,20,30,10,20,30,40]})
>>> df
   id   city  year  value
0   1    abc  2008     10
1   1    abc  2009     20
2   1    abc  2010     30
3   2  def10  2008     10
4   2  def10  2010     20
5   3    ghk  2009     30  # The problem is here
6   3    ghk  2009     40  # same (id, city, year)

你需要做出决定。您是要保留第 5 行还是第 6 行,还是应用数学函数(均值、总和...)。假设您想要 (3, ghk, 2009) 的平均值:

>>> df.pivot_table(index=["id", "city"], columns="year", values="value", aggfunc="mean") \
      .stack(dropna=False) \
      .rename("value") \
      .reset_index()

   id   city  year  value
0   1    abc  2008   10.0
1   1    abc  2009   20.0
2   1    abc  2010   30.0
3   2  def10  2008   10.0
4   2  def10  2009    NaN
5   2  def10  2010   20.0
6   3    ghk  2008    NaN
7   3    ghk  2009   35.0  # <- mean of (30, 40)
8   3    ghk  2010    NaN

您已接近解决方案。您可以按如下方式稍微修改您的代码:

idx = pd.MultiIndex.from_product([df['id'].unique(),range(df.year.min(),df.year.max()+1)],names=['id','year'])

df2 = df.set_index(['id', 'year']).reindex(idx).reset_index()

df2['city'] = df2.groupby('id')['city'].ffill().bfill()

代码更改:

  1. 使用 id 的唯一值而不是索引
  2. 创建 MultiIndex
  3. 在 reindex()
  4. 之前在 idyear 上设置索引
  5. 用相同 id
  6. 的非 NaN 条目填写 city 列的 NaN

结果:

print(df2)

   id  year   city  value
0   1  2008    abc   10.0
1   1  2009    abc   20.0
2   1  2010    abc   30.0
3   2  2008  def10   10.0
4   2  2009  def10    NaN
5   2  2010  def10   20.0
6   3  2008    ghk    NaN
7   3  2009    ghk   30.0
8   3  2010    ghk    NaN

您可以选择重新排列列顺序,如果您愿意:

df2.insert(2, 'year', df2.pop('year'))
print(df2)

   id   city  year  value
0   1    abc  2008   10.0
1   1    abc  2009   20.0
2   1    abc  2010   30.0
3   2  def10  2008   10.0
4   2  def10  2009    NaN
5   2  def10  2010   20.0
6   3    ghk  2008    NaN
7   3    ghk  2009   30.0
8   3    ghk  2010    NaN

编辑

不使用reindex()也可以使用stack()unstack(),如下:

(df.set_index(['id', 'city', 'year'], append=True)
   .unstack()
   .groupby(level=[1, 2]).max()
   .stack(dropna=False)
).reset_index()

输出:

   id   city  year  value
0   1    abc  2008   10.0
1   1    abc  2009   20.0
2   1    abc  2010   30.0
3   2  def10  2008   10.0
4   2  def10  2009    NaN
5   2  def10  2010   20.0
6   3    ghk  2008    NaN
7   3    ghk  2009   30.0
8   3    ghk  2010    NaN