比较同一数据框中的两列并根据比较计算统计数据
comparing two columns within same dataframe and computing statistics from comparison
我有一个以日期时间作为索引和 3 列、id、收入和成本的数据框。
d = {'id' : ['4573', '4573', '4573', '958245','958245','958245'] \
,'revenue' : np.random.uniform(size=6),'cost' : np.random.uniform(size=6)}
e = ['2014-03-01','2014-04-01','2014-05-01','2014-05-01','2015-03-01','2015-02-01']
dateindex = [datetime.strptime(a, '%Y-%m-%d') for a in e]
df = pd.DataFrame(d)
df.index = dateindex
cost id revenue
2014-03-01 0.445597 4573 0.901713
2014-04-01 0.774029 4573 0.908302
2014-05-01 0.104274 4573 0.278444
2014-05-01 0.938426 958245 0.755022
2015-03-01 0.647886 958245 0.125072
2015-02-01 0.267773 958245 0.557496
我想对每个 id 的收入和成本进行各种比较。
例如:
伪代码:
If Revenue > Cost > 0
CountA = CountA + 1
Elif 0 < Revenue < Cost
CountB = CountB + 1
Elif Revenue > 0 > Cost
CountC = CountC + 1
Elif Revenue = 0 and Cost > 0
CountD = CountD + 1
对于案例A,我认为我可以做到:
df[['revenue']][df['id'] == '4573'] > df[['cost']][df['id'] == '4573']
但是我得到了:
ValueError: Can only compare identically-labeled DataFrame objects
有没有更有效的方法来做我想做的事?
首先创建你想要的函数,然后以可以应用于 df 的方式构建它,然后 groupby 'id' 并应用函数:
import pandas as pd
import numpy as np
import datetime
import collections
d = {'id' : ['4573', '4573', '4573', '958245','958245','958245'] \
,'revenue' : np.random.uniform(size=6),'cost' : np.random.uniform(size=6)}
e = ['2014-03-01','2014-04-01','2014-05-01','2014-05-01','2015-03-01','2015-02-01']
dateindex = [datetime.datetime.strptime(a, '%Y-%m-%d') for a in e]
df = pd.DataFrame(d)
df.index = dateindex
#create basic function
def Func(Cost,Revenue):
if Revenue > Cost > 0:
return 'A'
elif Cost>Revenue>0 :
return 'B'
elif Revenue > 0 > Cost:
return 'C'
elif Revenue == 0 and Cost > 0:
return 'D'
#create a function to use on df
def Func_df(df):
cases_list = [Func(x,y) for x,y in zip(df.cost.values,df.revenue.values)]
return collections.Counter(cases_list)
df.groupby('id').apply(Func_df)
输出(希望如此):
id
4573 {u'A': 1, u'B': 2}
958245 {u'A': 1, u'B': 2}
我有一个以日期时间作为索引和 3 列、id、收入和成本的数据框。
d = {'id' : ['4573', '4573', '4573', '958245','958245','958245'] \
,'revenue' : np.random.uniform(size=6),'cost' : np.random.uniform(size=6)}
e = ['2014-03-01','2014-04-01','2014-05-01','2014-05-01','2015-03-01','2015-02-01']
dateindex = [datetime.strptime(a, '%Y-%m-%d') for a in e]
df = pd.DataFrame(d)
df.index = dateindex
cost id revenue
2014-03-01 0.445597 4573 0.901713
2014-04-01 0.774029 4573 0.908302
2014-05-01 0.104274 4573 0.278444
2014-05-01 0.938426 958245 0.755022
2015-03-01 0.647886 958245 0.125072
2015-02-01 0.267773 958245 0.557496
我想对每个 id 的收入和成本进行各种比较。
例如:
伪代码:
If Revenue > Cost > 0
CountA = CountA + 1
Elif 0 < Revenue < Cost
CountB = CountB + 1
Elif Revenue > 0 > Cost
CountC = CountC + 1
Elif Revenue = 0 and Cost > 0
CountD = CountD + 1
对于案例A,我认为我可以做到:
df[['revenue']][df['id'] == '4573'] > df[['cost']][df['id'] == '4573']
但是我得到了:
ValueError: Can only compare identically-labeled DataFrame objects
有没有更有效的方法来做我想做的事?
首先创建你想要的函数,然后以可以应用于 df 的方式构建它,然后 groupby 'id' 并应用函数:
import pandas as pd
import numpy as np
import datetime
import collections
d = {'id' : ['4573', '4573', '4573', '958245','958245','958245'] \
,'revenue' : np.random.uniform(size=6),'cost' : np.random.uniform(size=6)}
e = ['2014-03-01','2014-04-01','2014-05-01','2014-05-01','2015-03-01','2015-02-01']
dateindex = [datetime.datetime.strptime(a, '%Y-%m-%d') for a in e]
df = pd.DataFrame(d)
df.index = dateindex
#create basic function
def Func(Cost,Revenue):
if Revenue > Cost > 0:
return 'A'
elif Cost>Revenue>0 :
return 'B'
elif Revenue > 0 > Cost:
return 'C'
elif Revenue == 0 and Cost > 0:
return 'D'
#create a function to use on df
def Func_df(df):
cases_list = [Func(x,y) for x,y in zip(df.cost.values,df.revenue.values)]
return collections.Counter(cases_list)
df.groupby('id').apply(Func_df)
输出(希望如此):
id
4573 {u'A': 1, u'B': 2}
958245 {u'A': 1, u'B': 2}