是否有 Pandas 函数等同于 Stata 填充?
Is there a Pandas function equivalent to Stata fillin?
Stata 的 fillin
命令使数据集成为矩形。我怎样才能在 Pandas 中做同样的事情?
我试过用这种方式模拟fillin
命令,但是很慢而且很贵:
from itertools import product
collapse_df = collapse_df[['cod', 'loc_id', 'fob']]
var = list(product(collapse_df['loc_id'], collapse_df['cod']))
var = list(set([i for i in var]))
var_df = collapse_df[0:0]
for idx,item in enumerate(var):
df_t = collapse_df[(collapse_df['loc_id'] == item[0]) & (collapse_df['cod'] == item[1])]
if (len(df_t) == 0):
df_t.loc[0, 'loc_id'] = item[0]
df_t.loc[0, 'cod'] = item[1]
var_df = pd.concat([var_df, df_t], axis=0)
collapse_df = var_df.drop_duplicates()
编辑:
输入:
https://drive.google.com/file/d/1giWlKXNFaXeLpaVSUDc04AwyogD-ASJK/view?usp=sharing
输出:
https://drive.google.com/file/d/1UcADbQnbDELGPHZVIt2BmCYYTzU5pZtf/view?usp=sharing
我不能 100% 确定您提供的输入是您想要的结果。但是,我尝试解决了这个与 Stata 文档所述不符的问题。
设置测试数据(提问时很重要,以后请提供)
import pandas as pd
import numpy as np
import itertools
np.random.seed(42)
test_data = pd.DataFrame(
{
'AgeGroup': np.random.choice(['20-24', '18-19', '10-17'], size=10, p=[0.75, 0.20, 0.05]),
'Sex': np.random.choice(['male', 'female'], size=10),
'Race': np.random.choice(['black', 'white'], size=10, p=[0.3, 0.7]),
'x1': np.random.uniform(size=10),
'x2': np.random.normal(0, 1, size=10)
}
)
test_data
输出:
AgeGroup Sex Race x1 x2
0 20-24 female black 0.785176 -0.600254
1 10-17 male white 0.199674 0.947440
2 20-24 female white 0.514234 0.291034
3 20-24 female white 0.592415 -0.635560
4 20-24 female black 0.046450 -1.021552
5 20-24 female white 0.607545 -0.161755
6 20-24 female black 0.170524 -0.533649
7 18-19 female black 0.065052 -0.005528
8 20-24 female white 0.948886 -0.229450
9 20-24 female white 0.965632 0.389349
我的做法基本上是:
- 找到标识列的所有可能组合
- 查找提供的数据集中不存在的组合
- 创建一个包含缺失组合的空数据集并将其连接到现有数据集
def fill_in(df, id_cols):
"""Fill in empty records for combinations of id_cols that do not exist
in dataset.
Args:
df: dataset
id_cols: list of identity columns
Returns:
filled_df: dataframe with empty records for missing combinations of id_cols
"""
# create all possible unique combinations of id_cols
# and find combos that do not exist in the dataset
id_combos = list(itertools.product(*[df[c].unique() for c in id_cols]))
existing_combos = df[id_cols].apply(tuple, axis=1).unique()
missing_combos = set(id_combos) - set(existing_combos)
# create an empty dataframe with the missing combos
other_cols = [c for c in df.columns if c not in id_cols]
new_idx = pd.MultiIndex.from_tuples(missing_combos, names=id_cols)
empty_data = np.empty(shape=(len(missing_combos), len(other_cols))).fill(np.nan)
filled_df = pd.DataFrame(data=empty_data, index=new_idx, columns=other_cols).reset_index()
# concat dataset with empty dataset for missing combos
return pd.concat([df.assign(_fill_in=0), filled_df.assign(_fill_in=1)])
尝试一下:
fill_df(test_data, ['AgeGroup', 'Sex', 'Race'])
结果:
AgeGroup Sex Race x1 x2 _fill_in
0 20-24 female black 0.785176 -0.600254 0
1 10-17 male white 0.199674 0.947440 0
2 20-24 female white 0.514234 0.291034 0
3 20-24 female white 0.592415 -0.635560 0
4 20-24 female black 0.046450 -1.021552 0
5 20-24 female white 0.607545 -0.161755 0
6 20-24 female black 0.170524 -0.533649 0
7 18-19 female black 0.065052 -0.005528 0
8 20-24 female white 0.948886 -0.229450 0
9 20-24 female white 0.965632 0.389349 0
0 10-17 female white NaN NaN 1
1 10-17 female black NaN NaN 1
2 18-19 male black NaN NaN 1
3 10-17 male black NaN NaN 1
4 18-19 male white NaN NaN 1
5 18-19 female white NaN NaN 1
6 20-24 male white NaN NaN 1
7 20-24 male black NaN NaN 1
Stata 的 fillin
命令使数据集成为矩形。我怎样才能在 Pandas 中做同样的事情?
我试过用这种方式模拟fillin
命令,但是很慢而且很贵:
from itertools import product
collapse_df = collapse_df[['cod', 'loc_id', 'fob']]
var = list(product(collapse_df['loc_id'], collapse_df['cod']))
var = list(set([i for i in var]))
var_df = collapse_df[0:0]
for idx,item in enumerate(var):
df_t = collapse_df[(collapse_df['loc_id'] == item[0]) & (collapse_df['cod'] == item[1])]
if (len(df_t) == 0):
df_t.loc[0, 'loc_id'] = item[0]
df_t.loc[0, 'cod'] = item[1]
var_df = pd.concat([var_df, df_t], axis=0)
collapse_df = var_df.drop_duplicates()
编辑:
输入: https://drive.google.com/file/d/1giWlKXNFaXeLpaVSUDc04AwyogD-ASJK/view?usp=sharing
输出: https://drive.google.com/file/d/1UcADbQnbDELGPHZVIt2BmCYYTzU5pZtf/view?usp=sharing
我不能 100% 确定您提供的输入是您想要的结果。但是,我尝试解决了这个与 Stata 文档所述不符的问题。
设置测试数据(提问时很重要,以后请提供)
import pandas as pd
import numpy as np
import itertools
np.random.seed(42)
test_data = pd.DataFrame(
{
'AgeGroup': np.random.choice(['20-24', '18-19', '10-17'], size=10, p=[0.75, 0.20, 0.05]),
'Sex': np.random.choice(['male', 'female'], size=10),
'Race': np.random.choice(['black', 'white'], size=10, p=[0.3, 0.7]),
'x1': np.random.uniform(size=10),
'x2': np.random.normal(0, 1, size=10)
}
)
test_data
输出:
AgeGroup Sex Race x1 x2
0 20-24 female black 0.785176 -0.600254
1 10-17 male white 0.199674 0.947440
2 20-24 female white 0.514234 0.291034
3 20-24 female white 0.592415 -0.635560
4 20-24 female black 0.046450 -1.021552
5 20-24 female white 0.607545 -0.161755
6 20-24 female black 0.170524 -0.533649
7 18-19 female black 0.065052 -0.005528
8 20-24 female white 0.948886 -0.229450
9 20-24 female white 0.965632 0.389349
我的做法基本上是:
- 找到标识列的所有可能组合
- 查找提供的数据集中不存在的组合
- 创建一个包含缺失组合的空数据集并将其连接到现有数据集
def fill_in(df, id_cols):
"""Fill in empty records for combinations of id_cols that do not exist
in dataset.
Args:
df: dataset
id_cols: list of identity columns
Returns:
filled_df: dataframe with empty records for missing combinations of id_cols
"""
# create all possible unique combinations of id_cols
# and find combos that do not exist in the dataset
id_combos = list(itertools.product(*[df[c].unique() for c in id_cols]))
existing_combos = df[id_cols].apply(tuple, axis=1).unique()
missing_combos = set(id_combos) - set(existing_combos)
# create an empty dataframe with the missing combos
other_cols = [c for c in df.columns if c not in id_cols]
new_idx = pd.MultiIndex.from_tuples(missing_combos, names=id_cols)
empty_data = np.empty(shape=(len(missing_combos), len(other_cols))).fill(np.nan)
filled_df = pd.DataFrame(data=empty_data, index=new_idx, columns=other_cols).reset_index()
# concat dataset with empty dataset for missing combos
return pd.concat([df.assign(_fill_in=0), filled_df.assign(_fill_in=1)])
尝试一下:
fill_df(test_data, ['AgeGroup', 'Sex', 'Race'])
结果:
AgeGroup Sex Race x1 x2 _fill_in
0 20-24 female black 0.785176 -0.600254 0
1 10-17 male white 0.199674 0.947440 0
2 20-24 female white 0.514234 0.291034 0
3 20-24 female white 0.592415 -0.635560 0
4 20-24 female black 0.046450 -1.021552 0
5 20-24 female white 0.607545 -0.161755 0
6 20-24 female black 0.170524 -0.533649 0
7 18-19 female black 0.065052 -0.005528 0
8 20-24 female white 0.948886 -0.229450 0
9 20-24 female white 0.965632 0.389349 0
0 10-17 female white NaN NaN 1
1 10-17 female black NaN NaN 1
2 18-19 male black NaN NaN 1
3 10-17 male black NaN NaN 1
4 18-19 male white NaN NaN 1
5 18-19 female white NaN NaN 1
6 20-24 male white NaN NaN 1
7 20-24 male black NaN NaN 1