根据列的数值添加行

Add rows based on the numerical value of a column

我可能会以错误的方式解决这个问题,但他们是我计划分析我的数据的方式,我需要为每个应用程序输入一个条目。

我的数据框看起来像这样:

ID   Job Title  Number Applied  Hired  Feature(Math)
 1  Accountant               3      2              1
 2   Marketing               1      1              0
 3     Finance               1      1              1

我需要让它看起来像这样(1 = 是,0 = 否):

ID   Job Title  Number Applied  Hired  Feature(Math)       
 1  Accountant               1      0              1
 2  Accountant               1      1              1
 3  Accountant               1      1              1
 4   Marketing               1      1              0
 5     Finance               1      1              1

我需要为每个申请者添加一行。 Number Applied 应始终为 1。完成后,我们可以删除 Number Applied 列。

还有一些我没有包括的附加功能。分析的重点是应用机器学习算法来预测一个人是否会根据他们的技能找到工作。我当前的数据框不起作用,因为当我将 hired 转换为 yes 或 no 时,它认为只雇用了 2 名具有数学技能的人,而不是 3 名。

这是我之前用于 "unroll" 一组聚合样本的方法。

from itertools import imap, izip

def iterdicts(df):
    """
    Utility to iterate over rows of a data frame as dictionaries.
    """
    col = df.columns
    for row in df.itertuples(name=None, index=False):
        yield dict(zip(col, row))

def deaggregate(dicts, *columns):
    """
    Deaggregate an iterable of dictionaries `dicts` where the numbers in `columns`
    are assumed to be aggregated counts.
    """
    for row in dicts:
        for i in xrange(max(row[c] for c in columns)):
            d = dict(row)

            # replace each count by a 0/1 indicator
            d.update({c: int(i < row[c]) for c in columns})
            yield d

def unroll(df, *columns):
    return pd.DataFrame(deaggregate(iterdicts(df), *columns))

然后你可以做

unroll(df, 'Number Applied', 'Hired')
   Feature(Math)  Hired  ID   Job Title  Number Applied
0              1      1   1  Accountant               1
1              1      1   1  Accountant               1
2              1      0   1  Accountant               1
3              0      1   2   Marketing               1
4              1      1   3     Finance               1
d1 = df.loc[df.index.repeat(df['Number Applied'])]

hired = (
    d1.groupby('Job Title').cumcount() >=
        d1['Number Applied'] - d1['Hired']
).astype(int)

d1.assign(**{'Number Applied': 1, 'Hired': hired})

   ID   Job Title  Number Applied  Hired  Feature(Math)
0   1  Accountant               1      0              1
0   1  Accountant               1      1              1
0   1  Accountant               1      1              1
1   2   Marketing               1      1              0
2   3     Finance               1      1              1