如何计算按一个属性分组但在 pandas 的第二列中提供的值
How to calculate a value grouped by one attribute, but provided in the second column in pandas
我有一个数据框,其中包含订单 ID、客户端 ID,Date_order 和一些指标(不太重要)
我想获取所有行的客户的最后 ID 订单号
我试过这个:
data=pd.DataFrame({'ID': [ 133853.0,155755.0,149331.0,337270.0,
775727.0,200868.0,138453.0,738497.0,666802.0,697070.0,128148.0,1042225.0,
303441.0,940515.0,143548.0],
'CLIENT':[ 235632.0,231562.0,235632.0,231562.0,734243.0,
235632.0,235632.0,734243.0,231562.0,734243.0,235632.0,734243.0,231562.0,
734243.0,235632.0],
'DATE_START': [ ('2017-09-01 00:00:00'),
('2017-10-05 00:00:00'),('2017-09-26 00:00:00'),
('2018-03-23 00:00:00'),('2018-12-21 00:00:00'),
('2017-11-23 00:00:00'),('2017-09-08 00:00:00'),
('2018-12-12 00:00:00'),('2018-11-21 00:00:00'),
('2018-12-01 00:00:00'),('2017-08-22 00:00:00'),
('2019-02-06 00:00:00'),('2018-02-20 00:00:00'),
('2019-01-20 00:00:00'),('2017-09-17 00:00:00')]})
data.groupby('CLIENT').apply(lambda x:max(x['ID']))
enter image description here
它考虑了所有 ID,只显示三行 Client 和 max ID,但我只需要在前面几行中查找所有行 DataFrame。请帮忙)
import pandas as pd
data=pd.DataFrame({
'ID': [133853.0,155755.0,149331.0,337270.0,
775727.0,200868.0,138453.0,738497.0,
666802.0,697070.0,128148.0,1042225.0,
303441.0,940515.0,143548.0],
'CLIENT':[235632.0,231562.0,235632.0,231562.0,734243.0,
235632.0,235632.0,734243.0,231562.0,734243.0,
235632.0,734243.0,231562.0,734243.0,235632.0],
'DATE_START': [('2017-09-01 00:00:00'), ('2017-10-05 00:00:00'),
('2017-09-26 00:00:00'), ('2018-03-23 00:00:00'),
('2018-12-21 00:00:00'), ('2017-11-23 00:00:00'),
('2017-09-08 00:00:00'), ('2018-12-12 00:00:00'),
('2018-11-21 00:00:00'), ('2018-12-01 00:00:00'),
('2017-08-22 00:00:00'), ('2019-02-06 00:00:00'),
('2018-02-20 00:00:00'), ('2019-01-20 00:00:00'),
('2017-09-17 00:00:00')]
})
data.groupby('CLIENT').apply(lambda df:
df[df['DATE_START'] == df['DATE_START'].max()].iloc[0][['ID', 'DATE_START']]
)
输出:
CLIENT ID DATE_START
231562.0 666802.0 2018-11-21 00:00:00
235632.0 200868.0 2017-11-23 00:00:00
734243.0 1042225.0 2019-02-06 00:00:00
让我们分解一下:
1.) 分组依据 CLIENT
。这将形成一个可迭代的数据帧,按 CLIENT
.
分组
2.) 使用逻辑将函数应用于组中的每个数据帧(这就是 apply(lambda df: ...)
部分的用途)
3.) 对于每个数据帧,找到最新的 DATE_START
,然后对每个数据帧进行子集化以仅显示 ID
和最新的 DATE_START
(这就是 df[df['DATE_START'] == df['DATE_START'].max()]
是为了).
4.) 在这一点上,如果同一天有来自客户的多个订单,我不知道你要应用什么逻辑。在本例中,我使用了第一个匹配项 (.iloc[0]
).
5.) 然后我 return ID
和 DATE_START
。
6.) pandas
将理解您希望将应用于可迭代对象中每个数据帧的逻辑组合起来 row-wise,这就是输出如此的原因。
如果这就是您要找的,请告诉我for.q
data['id_last_order']= data.sort_values('DATE_START').groupby('CLIENT')['ID'].transform(lambda x: x.shift())
或具有创建功能
def select_last_order_id(row):
df = data[(data['CLIENT']==row['CLIENT'])&(data['DATE_START']<row['DATE_START'])]
try:
value = df.groupby(by=['ID','CLIENT'],as_index=False,sort = False).agg('max')['ID'].values[0]
except Exception:
value = None
return(value)
data['id_last_order'] = data.apply(select_last_order_id,axis=1)
我有一个数据框,其中包含订单 ID、客户端 ID,Date_order 和一些指标(不太重要) 我想获取所有行的客户的最后 ID 订单号
我试过这个:
data=pd.DataFrame({'ID': [ 133853.0,155755.0,149331.0,337270.0,
775727.0,200868.0,138453.0,738497.0,666802.0,697070.0,128148.0,1042225.0,
303441.0,940515.0,143548.0],
'CLIENT':[ 235632.0,231562.0,235632.0,231562.0,734243.0,
235632.0,235632.0,734243.0,231562.0,734243.0,235632.0,734243.0,231562.0,
734243.0,235632.0],
'DATE_START': [ ('2017-09-01 00:00:00'),
('2017-10-05 00:00:00'),('2017-09-26 00:00:00'),
('2018-03-23 00:00:00'),('2018-12-21 00:00:00'),
('2017-11-23 00:00:00'),('2017-09-08 00:00:00'),
('2018-12-12 00:00:00'),('2018-11-21 00:00:00'),
('2018-12-01 00:00:00'),('2017-08-22 00:00:00'),
('2019-02-06 00:00:00'),('2018-02-20 00:00:00'),
('2019-01-20 00:00:00'),('2017-09-17 00:00:00')]})
data.groupby('CLIENT').apply(lambda x:max(x['ID']))
enter image description here
它考虑了所有 ID,只显示三行 Client 和 max ID,但我只需要在前面几行中查找所有行 DataFrame。请帮忙)
import pandas as pd
data=pd.DataFrame({
'ID': [133853.0,155755.0,149331.0,337270.0,
775727.0,200868.0,138453.0,738497.0,
666802.0,697070.0,128148.0,1042225.0,
303441.0,940515.0,143548.0],
'CLIENT':[235632.0,231562.0,235632.0,231562.0,734243.0,
235632.0,235632.0,734243.0,231562.0,734243.0,
235632.0,734243.0,231562.0,734243.0,235632.0],
'DATE_START': [('2017-09-01 00:00:00'), ('2017-10-05 00:00:00'),
('2017-09-26 00:00:00'), ('2018-03-23 00:00:00'),
('2018-12-21 00:00:00'), ('2017-11-23 00:00:00'),
('2017-09-08 00:00:00'), ('2018-12-12 00:00:00'),
('2018-11-21 00:00:00'), ('2018-12-01 00:00:00'),
('2017-08-22 00:00:00'), ('2019-02-06 00:00:00'),
('2018-02-20 00:00:00'), ('2019-01-20 00:00:00'),
('2017-09-17 00:00:00')]
})
data.groupby('CLIENT').apply(lambda df:
df[df['DATE_START'] == df['DATE_START'].max()].iloc[0][['ID', 'DATE_START']]
)
输出:
CLIENT ID DATE_START
231562.0 666802.0 2018-11-21 00:00:00
235632.0 200868.0 2017-11-23 00:00:00
734243.0 1042225.0 2019-02-06 00:00:00
让我们分解一下:
1.) 分组依据 CLIENT
。这将形成一个可迭代的数据帧,按 CLIENT
.
2.) 使用逻辑将函数应用于组中的每个数据帧(这就是 apply(lambda df: ...)
部分的用途)
3.) 对于每个数据帧,找到最新的 DATE_START
,然后对每个数据帧进行子集化以仅显示 ID
和最新的 DATE_START
(这就是 df[df['DATE_START'] == df['DATE_START'].max()]
是为了).
4.) 在这一点上,如果同一天有来自客户的多个订单,我不知道你要应用什么逻辑。在本例中,我使用了第一个匹配项 (.iloc[0]
).
5.) 然后我 return ID
和 DATE_START
。
6.) pandas
将理解您希望将应用于可迭代对象中每个数据帧的逻辑组合起来 row-wise,这就是输出如此的原因。
如果这就是您要找的,请告诉我for.q
data['id_last_order']= data.sort_values('DATE_START').groupby('CLIENT')['ID'].transform(lambda x: x.shift())
或具有创建功能
def select_last_order_id(row):
df = data[(data['CLIENT']==row['CLIENT'])&(data['DATE_START']<row['DATE_START'])]
try:
value = df.groupby(by=['ID','CLIENT'],as_index=False,sort = False).agg('max')['ID'].values[0]
except Exception:
value = None
return(value)
data['id_last_order'] = data.apply(select_last_order_id,axis=1)