比较同一数据框中的两列并根据比较计算统计数据

comparing two columns within same dataframe and computing statistics from comparison

我有一个以日期时间作为索引和 3 列、id、收入和成本的数据框。

d = {'id' : ['4573', '4573', '4573', '958245','958245','958245'] \
,'revenue' : np.random.uniform(size=6),'cost' : np.random.uniform(size=6)}
e = ['2014-03-01','2014-04-01','2014-05-01','2014-05-01','2015-03-01','2015-02-01']

dateindex = [datetime.strptime(a, '%Y-%m-%d') for a in e]

df = pd.DataFrame(d)
df.index = dateindex

    cost    id  revenue
2014-03-01  0.445597    4573    0.901713
2014-04-01  0.774029    4573    0.908302
2014-05-01  0.104274    4573    0.278444
2014-05-01  0.938426    958245  0.755022
2015-03-01  0.647886    958245  0.125072
2015-02-01  0.267773    958245  0.557496

我想对每个 id 的收入和成本进行各种比较。

例如:

伪代码:

If Revenue > Cost > 0
 CountA = CountA + 1
Elif 0 < Revenue < Cost 
 CountB = CountB + 1
Elif Revenue > 0 > Cost
 CountC = CountC + 1
Elif Revenue = 0 and Cost > 0
 CountD = CountD + 1

对于案例A,我认为我可以做到:

df[['revenue']][df['id'] == '4573'] > df[['cost']][df['id'] == '4573']

但是我得到了:

ValueError: Can only compare identically-labeled DataFrame objects

有没有更有效的方法来做我想做的事?

首先创建你想要的函数,然后以可以应用于 df 的方式构建它,然后 groupby 'id' 并应用函数:

import pandas as pd
import numpy as np
import datetime
import collections

d = {'id' : ['4573', '4573', '4573', '958245','958245','958245'] \
 ,'revenue' : np.random.uniform(size=6),'cost' : np.random.uniform(size=6)}
e = ['2014-03-01','2014-04-01','2014-05-01','2014-05-01','2015-03-01','2015-02-01']

dateindex = [datetime.datetime.strptime(a, '%Y-%m-%d') for a in e]

df = pd.DataFrame(d)
df.index = dateindex

#create basic function
def Func(Cost,Revenue):
    if Revenue > Cost > 0:
        return 'A'
    elif  Cost>Revenue>0 :
        return 'B'
    elif Revenue > 0 > Cost:
        return 'C'
    elif Revenue == 0 and Cost > 0:
        return 'D'

#create a function to use on df
def Func_df(df):
    cases_list =  [Func(x,y) for x,y in zip(df.cost.values,df.revenue.values)]
    return collections.Counter(cases_list)

df.groupby('id').apply(Func_df)

输出(希望如此):

id
4573      {u'A': 1, u'B': 2}
958245    {u'A': 1, u'B': 2}