根据条件创建一个可以等于许多列之一的新列 (Pandas)

Question

我有一个宽 pandas 数据框，其中包含许多标题形式为 'crimeYR.' 的变量，例如，crime1996 是一个虚拟变量，它告诉我某项观察是否已被定罪直到 1996 年，crime1998 告诉我 he/she 是否在 1998 年之前被定罪，等等。数据集中的每个人都出生在不同的年份，我想创建一个变量来告诉我一个人是否在 25 岁之前犯罪。这是我想创建的示例

birthYR  crime2006 crime2008 crime2010 crimeby25
1981         0         1         1         0
1981         1         1         1         1
1983         0         1         1         1
1982         0         0         1         0

我对如何在 stata 中编码它有一个大概的了解，但我正在努力让它在 python 中工作。这是它在 stata 中如何工作的想法：

gen crimeby25 = 0
foreach v of num 2006/2016{
     replace crimeby25 = crime`v' if `v' - birthyr == 25
}

我在 Python 中尝试做的事情的简单方法是什么？

Answer 1

这是一个解决方案。您有以下数据框：

df = pd.DataFrame({'birthYR': [1981,1981,1983,1982],
                   'crime2006': [0,1,0,0],
                   'crime2008': [1,1,1,0],
                   'crime2010':[1,1,1,1]})

df

birthYR  crime2006 crime2008 crime2010 crimeby25
1981         0         1         1         0
1981         1         1         1         1
1983         0         1         1         1
1982         0         0         1         0

让我们首先定义我们正在研究的年份列表：

years = [2006,2008,2010]

我们创建了一些中间有用的栏目

for year in years :
    # Compute the age at a given year
    df["AgeIn"+str(year)] = year - df["birthYR"]

    # Is he/she more than 25 at a given year
    df["NotMoreThan25In"+str(year)] = df["AgeIn"+str(year)]<=25

    # Let's remove age column for clarity
    df = df.drop("AgeIn"+str(year),axis=1)

    # Check if he/she commited crime and was not more than 25 at a given year
    df["NotMoreThan25In"+str(year)+"AndCrime"] = df["NotMoreThan25In"+str(year)]*df["crime"+str(year)]

最后，我们只是将每年的总数相加，看看 he/she 是否在 25 岁前犯罪 :

df["crimeby25"] = df[["Not25In"+str(year)+"AndCrime" for year in years]].max(axis=1)

df["crimeby25"]

0    0
1    1
2    1
3    0

根据条件创建一个可以等于许多列之一的新列 (Pandas)

Creating a new column that can equal one of many columns depending on condition (Pandas)

python

calculated-columns

pandas