是否有 Pandas 函数等同于 Stata 填充？

Question

Stata 的 fillin 命令使数据集成为矩形。我怎样才能在 Pandas 中做同样的事情？

我试过用这种方式模拟fillin命令，但是很慢而且很贵：

from itertools import product

collapse_df = collapse_df[['cod', 'loc_id', 'fob']]
var = list(product(collapse_df['loc_id'], collapse_df['cod']))
var = list(set([i for i in var]))
var_df = collapse_df[0:0]
for idx,item in enumerate(var):
    df_t = collapse_df[(collapse_df['loc_id'] == item[0]) & (collapse_df['cod'] == item[1])]
    if (len(df_t) == 0):
        df_t.loc[0, 'loc_id'] = item[0]
        df_t.loc[0, 'cod'] = item[1]
    var_df = pd.concat([var_df, df_t], axis=0)
collapse_df = var_df.drop_duplicates()

编辑：

输入： https://drive.google.com/file/d/1giWlKXNFaXeLpaVSUDc04AwyogD-ASJK/view?usp=sharing

输出： https://drive.google.com/file/d/1UcADbQnbDELGPHZVIt2BmCYYTzU5pZtf/view?usp=sharing

资源：https://www.stata.com/manuals13/dfillin.pdf

Answer 1

我不能 100% 确定您提供的输入是您想要的结果。但是，我尝试解决了这个与 Stata 文档所述不符的问题。

设置测试数据（提问时很重要，以后请提供）

import pandas as pd 
import numpy as np 
import itertools

np.random.seed(42)
test_data = pd.DataFrame(
    {
        'AgeGroup': np.random.choice(['20-24', '18-19', '10-17'], size=10, p=[0.75, 0.20, 0.05]),
        'Sex': np.random.choice(['male', 'female'], size=10),
        'Race': np.random.choice(['black', 'white'], size=10, p=[0.3, 0.7]),
        'x1': np.random.uniform(size=10),
        'x2': np.random.normal(0, 1, size=10)
    }
)

test_data

输出：

  AgeGroup     Sex   Race        x1        x2
0    20-24  female  black  0.785176 -0.600254
1    10-17    male  white  0.199674  0.947440
2    20-24  female  white  0.514234  0.291034
3    20-24  female  white  0.592415 -0.635560
4    20-24  female  black  0.046450 -1.021552
5    20-24  female  white  0.607545 -0.161755
6    20-24  female  black  0.170524 -0.533649
7    18-19  female  black  0.065052 -0.005528
8    20-24  female  white  0.948886 -0.229450
9    20-24  female  white  0.965632  0.389349

我的做法基本上是：

找到标识列的所有可能组合
查找提供的数据集中不存在的组合
创建一个包含缺失组合的空数据集并将其连接到现有数据集

def fill_in(df, id_cols):
    """Fill in empty records for combinations of id_cols that do not exist
    in dataset.
    
    Args:
        df: dataset
        id_cols: list of identity columns

    Returns:
        filled_df: dataframe with empty records for missing combinations of id_cols
    """
    # create all possible unique combinations of id_cols
    # and find combos that do not exist in the dataset
    id_combos = list(itertools.product(*[df[c].unique() for c in id_cols]))
    existing_combos = df[id_cols].apply(tuple, axis=1).unique()
    missing_combos = set(id_combos) - set(existing_combos)

    # create an empty dataframe with the missing combos
    other_cols = [c for c in df.columns if c not in id_cols]
    new_idx = pd.MultiIndex.from_tuples(missing_combos, names=id_cols)
    empty_data = np.empty(shape=(len(missing_combos), len(other_cols))).fill(np.nan)
    filled_df = pd.DataFrame(data=empty_data, index=new_idx, columns=other_cols).reset_index()

    # concat dataset with empty dataset for missing combos
    return pd.concat([df.assign(_fill_in=0), filled_df.assign(_fill_in=1)])

尝试一下：

fill_df(test_data, ['AgeGroup', 'Sex', 'Race'])

结果：

  AgeGroup     Sex   Race        x1        x2  _fill_in
0    20-24  female  black  0.785176 -0.600254         0
1    10-17    male  white  0.199674  0.947440         0
2    20-24  female  white  0.514234  0.291034         0
3    20-24  female  white  0.592415 -0.635560         0
4    20-24  female  black  0.046450 -1.021552         0
5    20-24  female  white  0.607545 -0.161755         0
6    20-24  female  black  0.170524 -0.533649         0
7    18-19  female  black  0.065052 -0.005528         0
8    20-24  female  white  0.948886 -0.229450         0
9    20-24  female  white  0.965632  0.389349         0
0    10-17  female  white       NaN       NaN         1
1    10-17  female  black       NaN       NaN         1
2    18-19    male  black       NaN       NaN         1
3    10-17    male  black       NaN       NaN         1
4    18-19    male  white       NaN       NaN         1
5    18-19  female  white       NaN       NaN         1
6    20-24    male  white       NaN       NaN         1
7    20-24    male  black       NaN       NaN         1

是否有 Pandas 函数等同于 Stata 填充？

Is there a Pandas function equivalent to Stata fillin?

python

dataframe

stata

python-3.x

pandas