根据 属性 将对象列表高效 Python 分组到日期范围内
Efficient Python grouping of object list into date ranges based on property
我写了一些代码,它接受一个“订单”对象的输入列表,看起来像这样...
**Company, theDate, Product, ProductReceived**
Apple, 2020-10-01, Subscription, 0
Apple, 2020-10-01, Trial, 0
Apple, 2020-11-01, Subscription, 0
Apple, 2020-11-01, Trial, 1
Apple, 2020-12-01, Subscription, 1
Apple, 2020-12-01, Trial, 0
Apple, 2021-01-01, Subscription, 1
Apple, 2021-01-01, Trial, 1
Apple, 2021-02-01, Subscription, 1
Apple, 2021-02-01, Trial, 1
Apple, 2021-03-01, Subscription, 0
Apple, 2021-03-01, Trial, 1
并将其转换为如下所示的简单字符串输出...
[ ] 2020.10 - 2020.12
[x] 2021.01 - 2021.02
[ ] 2021.03
其想法是根据是否已收到给定日期的所有产品,将所有日期分组在连续范围内。
代码有效,但感觉非常笨拙,我怀疑 python/pandas 可能为这样的任务提供近乎开箱即用的解决方案。非常感谢对最优雅的解决方案的任何想法,因为此代码(从我的真实示例中简化而来)已经变得难以扩展。这是一个完整的工作示例...
class orderObject:
def __init__(self,company, theDate, product, productReceived):
self.company = company
self.theDate = theDate
self.product = product
self.productReceived = productReceived
orderList = [orderObject('Apple','2020-10-01','Subscription',0)
, orderObject('Apple','2020-10-01','Trial',0)
, orderObject('Apple','2020-11-01','Subscription',0)
, orderObject('Apple','2020-11-01','Trial',1)
, orderObject('Apple','2020-12-01','Subscription',1)
, orderObject('Apple','2020-12-01','Trial',0)
, orderObject('Apple','2021-01-01','Subscription',1)
, orderObject('Apple','2021-01-01','Trial',1)
, orderObject('Apple','2021-02-01','Subscription',1)
, orderObject('Apple','2021-02-01','Trial',1)
, orderObject('Apple','2021-03-01','Subscription',0)
, orderObject('Apple','2021-03-01','Trial',1)
]
dateSet = {}
for o in orderList:
dateSet[o.theDate] = 1 #loop through once to assume every date is complete
for o in orderList: #loop through a second time, and take any incomplete date/product as a fail (i.e. all products must be complete for the date to be complete)
if o.productReceived < 1:
dateSet[o.theDate] = 0
#now dateSet contains references to every date and whether we have received products for ALL products at that date, e.g. attr: 2020-10-01, value: 0 etc
def updateCompanyOrderGrouped(lcGroup):
indexText = startGroup[:-3].replace("-",".")
if (startGroup!=endGroup):
indexText = indexText + " - " + endGroup[:-3].replace("-",".")
if (lastValue==0):
lcGroup += "\n[ ] " + indexText
else:
lcGroup += "\n[x] " + indexText
return lcGroup
companyOrderGrouped = "" #string to store grouping result
lastValue = -1 #start with a "last value" that won't match any current value
startGroup = endGroup = "NA" #start with startGroup and endGroup that won't match any current groups
for attr, value in dateSet.items():
if value != lastValue:
if startGroup != "NA":
#we just ended a grouping because the value changed...
companyOrderGrouped = updateCompanyOrderGrouped(companyOrderGrouped)
startGroup = attr
#whatever happens, update the endGroup + value to keep track of the last record
endGroup = attr
lastValue = value
print(updateCompanyOrderGrouped(companyOrderGrouped).lstrip('\n')) #finish with last call to updateCompanyOrderGrouped to end the final group
您可能正在寻找这样的东西:
from pandas import DataFrame
order_columns = ['company', 'date', 'product', 'received']
order_data = [
['Apple', '2020-10-01', 'Subscription', 0],
['Apple', '2020-10-01', 'Trial', 0],
['Apple', '2020-11-01', 'Subscription', 0],
['Apple', '2020-11-01', 'Trial', 1],
['Apple', '2020-12-01', 'Subscription', 1],
['Apple', '2020-12-01', 'Trial', 0],
['Apple', '2021-01-01', 'Subscription', 1],
['Apple', '2021-01-01', 'Trial', 1],
['Apple', '2021-02-01', 'Subscription', 1],
['Apple', '2021-02-01', 'Trial', 1],
['Apple', '2021-03-01', 'Subscription', 0],
['Apple', '2021-03-01', 'Trial', 1]
]
df = DataFrame(order_data, columns=order_columns)
# create an extra column all_received that is 0 or 1 for all products received on a date
df['all_received'] = df.groupby(by='date')['received'].transform('all').astype(int)
# grouping that uses a temporary series changing value every time all_received changes
grouping = df.groupby([(df.all_received != df.all_received.shift()).cumsum()])
# from that grouping, the value and date of every first + date of every last
result = [(last.all_received, first.date[:-3].replace("-", "."), last.date[:-3].replace("-", "."))
for first, last in zip(grouping.nth(0).itertuples(), grouping.nth(-1).itertuples())]
# print similar to your format:
for value, start, end in result:
print(f'[{"x" if value else " "}] {start}{" - " + end if end != start else ""}')
因此,除了数据的定义之外,您是对的:它是三行代码,但我认为您的解决方案更具可读性。阅读这些密集的 pandas
代码行并了解发生了什么需要一些时间。
请注意,我将一些变量重命名为全部小写并带有下划线,这是 Python 的推荐命名约定。
输出:
[ ] 2020.10 - 2020.12
[x] 2021.01 - 2021.02
[ ] 2021.03
由于您评论说这有点超出您的 pandas
理解范围,这里有一些背景知识:
df['all_received'] = df.groupby(by='date')['received'].transform('all').astype(int)
.groupby
通过 'date'
在 df
中创建了一组记录,即将所有具有相同日期的记录组合在一起。然后它从分组中选择 'received'
列并将 .transform('all')
应用于它,这将创建一个序列,如果组中的所有值都是真实的(如 1
) 或 False
否则(即如果一个或多个 0
)。最后,.astype(int)
再次将这些布尔值转换为整数(0
或 1
)。生成的系列被分配给一个新的 all_received
列,该列仍然具有相同数量的记录。
grouping = df.groupby([(df.all_received != df.all_received.shift()).cumsum()])
如果 all_received
在当前记录上的值与下一个记录上的值相同(通过将其与同一列进行比较,移位 1),则内部位 df.all_received != df.all_received.shift()
为 1
位置)或 0
否则。然后将所得序列累加(即 [0, 1, 1, 0, 1]
将变为 [0, 1, 2, 2, 3]
)。这意味着结果系列对应于 all_received
中的 1
和 0
的分组,而不改变顺序(就像 .groupby
将它应用到all_received
)。然后根据该临时系列创建原始 df
的分组,这就是您想要的分组。
result = [(last.all_received, first.date[:-3].replace("-", "."), last.date[:-3].replace("-", "."))
for first, last in zip(grouping.nth(0).itertuples(), grouping.nth(-1).itertuples())]
在最后一点中,该分组被使用了两次(这就是单独创建它的原因)。 grouping.nth(0).itertuples()
将为您提供每组的第一条记录,作为值的元组。同样,grouping.nth(-1).itertuples()
为您提供每个组的最后一条记录。通过将这些可迭代对象压缩在一起,您可以获得每组的第一条和最后一条记录对——这正是创建输出所需要的。剩下的只是一个正常的列表理解,采用 first
和 last
的格式化日期以及 last
的 all_received
(也可以使用 first
,他们都在同一个组,所以会是一样的)。
当然,最后 print 语句会以您需要的格式生成输出,尽管在上一步中已经进行了日期格式化。
另一种选择,使用 pandas 并为将来保持一定的可读性 changes/extensions:
df = pd.read_csv('pandas-groupby-date-ranges.csv') # original data
df['theDate'] = pd.to_datetime(df['theDate'], infer_datetime_format=True)
df['year'] = df['theDate'].dt.year
df['month'] = df['theDate'].dt.month
aggregate_df = pd.DataFrame()
for name, group in df.groupby(['year', 'month']):
group['all_received'] = group['ProductReceived'].all()
aggregate_df = pd.concat([aggregate_df, group])
aggregate_df['group'] = (
aggregate_df['all_received'].ne(aggregate_df['all_received'].shift(1)).cumsum()
)
for name, group in aggregate_df.groupby('group'):
group_min = group['theDate'].min()
group_max = group['theDate'].max()
# output to desired format
x = '[x] ' if group['all_received'].iloc[0] == True else '[ ] '
if group_min != group_max:
print(
x
+ str(group_min.year)
+ "."
+ str(group_min.month)
+ ' - '
+ str(group_max.year)
+ "."
+ str(group_max.month)
)
else:
print(x + str(group_min.year) + "." + str(group_min.month))
输出:
[ ] 2020.10 - 2020.12
[x] 2021.1 - 2021.2
[ ] 2021.3
我写了一些代码,它接受一个“订单”对象的输入列表,看起来像这样...
**Company, theDate, Product, ProductReceived**
Apple, 2020-10-01, Subscription, 0
Apple, 2020-10-01, Trial, 0
Apple, 2020-11-01, Subscription, 0
Apple, 2020-11-01, Trial, 1
Apple, 2020-12-01, Subscription, 1
Apple, 2020-12-01, Trial, 0
Apple, 2021-01-01, Subscription, 1
Apple, 2021-01-01, Trial, 1
Apple, 2021-02-01, Subscription, 1
Apple, 2021-02-01, Trial, 1
Apple, 2021-03-01, Subscription, 0
Apple, 2021-03-01, Trial, 1
并将其转换为如下所示的简单字符串输出...
[ ] 2020.10 - 2020.12
[x] 2021.01 - 2021.02
[ ] 2021.03
其想法是根据是否已收到给定日期的所有产品,将所有日期分组在连续范围内。
代码有效,但感觉非常笨拙,我怀疑 python/pandas 可能为这样的任务提供近乎开箱即用的解决方案。非常感谢对最优雅的解决方案的任何想法,因为此代码(从我的真实示例中简化而来)已经变得难以扩展。这是一个完整的工作示例...
class orderObject:
def __init__(self,company, theDate, product, productReceived):
self.company = company
self.theDate = theDate
self.product = product
self.productReceived = productReceived
orderList = [orderObject('Apple','2020-10-01','Subscription',0)
, orderObject('Apple','2020-10-01','Trial',0)
, orderObject('Apple','2020-11-01','Subscription',0)
, orderObject('Apple','2020-11-01','Trial',1)
, orderObject('Apple','2020-12-01','Subscription',1)
, orderObject('Apple','2020-12-01','Trial',0)
, orderObject('Apple','2021-01-01','Subscription',1)
, orderObject('Apple','2021-01-01','Trial',1)
, orderObject('Apple','2021-02-01','Subscription',1)
, orderObject('Apple','2021-02-01','Trial',1)
, orderObject('Apple','2021-03-01','Subscription',0)
, orderObject('Apple','2021-03-01','Trial',1)
]
dateSet = {}
for o in orderList:
dateSet[o.theDate] = 1 #loop through once to assume every date is complete
for o in orderList: #loop through a second time, and take any incomplete date/product as a fail (i.e. all products must be complete for the date to be complete)
if o.productReceived < 1:
dateSet[o.theDate] = 0
#now dateSet contains references to every date and whether we have received products for ALL products at that date, e.g. attr: 2020-10-01, value: 0 etc
def updateCompanyOrderGrouped(lcGroup):
indexText = startGroup[:-3].replace("-",".")
if (startGroup!=endGroup):
indexText = indexText + " - " + endGroup[:-3].replace("-",".")
if (lastValue==0):
lcGroup += "\n[ ] " + indexText
else:
lcGroup += "\n[x] " + indexText
return lcGroup
companyOrderGrouped = "" #string to store grouping result
lastValue = -1 #start with a "last value" that won't match any current value
startGroup = endGroup = "NA" #start with startGroup and endGroup that won't match any current groups
for attr, value in dateSet.items():
if value != lastValue:
if startGroup != "NA":
#we just ended a grouping because the value changed...
companyOrderGrouped = updateCompanyOrderGrouped(companyOrderGrouped)
startGroup = attr
#whatever happens, update the endGroup + value to keep track of the last record
endGroup = attr
lastValue = value
print(updateCompanyOrderGrouped(companyOrderGrouped).lstrip('\n')) #finish with last call to updateCompanyOrderGrouped to end the final group
您可能正在寻找这样的东西:
from pandas import DataFrame
order_columns = ['company', 'date', 'product', 'received']
order_data = [
['Apple', '2020-10-01', 'Subscription', 0],
['Apple', '2020-10-01', 'Trial', 0],
['Apple', '2020-11-01', 'Subscription', 0],
['Apple', '2020-11-01', 'Trial', 1],
['Apple', '2020-12-01', 'Subscription', 1],
['Apple', '2020-12-01', 'Trial', 0],
['Apple', '2021-01-01', 'Subscription', 1],
['Apple', '2021-01-01', 'Trial', 1],
['Apple', '2021-02-01', 'Subscription', 1],
['Apple', '2021-02-01', 'Trial', 1],
['Apple', '2021-03-01', 'Subscription', 0],
['Apple', '2021-03-01', 'Trial', 1]
]
df = DataFrame(order_data, columns=order_columns)
# create an extra column all_received that is 0 or 1 for all products received on a date
df['all_received'] = df.groupby(by='date')['received'].transform('all').astype(int)
# grouping that uses a temporary series changing value every time all_received changes
grouping = df.groupby([(df.all_received != df.all_received.shift()).cumsum()])
# from that grouping, the value and date of every first + date of every last
result = [(last.all_received, first.date[:-3].replace("-", "."), last.date[:-3].replace("-", "."))
for first, last in zip(grouping.nth(0).itertuples(), grouping.nth(-1).itertuples())]
# print similar to your format:
for value, start, end in result:
print(f'[{"x" if value else " "}] {start}{" - " + end if end != start else ""}')
因此,除了数据的定义之外,您是对的:它是三行代码,但我认为您的解决方案更具可读性。阅读这些密集的 pandas
代码行并了解发生了什么需要一些时间。
请注意,我将一些变量重命名为全部小写并带有下划线,这是 Python 的推荐命名约定。
输出:
[ ] 2020.10 - 2020.12
[x] 2021.01 - 2021.02
[ ] 2021.03
由于您评论说这有点超出您的 pandas
理解范围,这里有一些背景知识:
df['all_received'] = df.groupby(by='date')['received'].transform('all').astype(int)
.groupby
通过 'date'
在 df
中创建了一组记录,即将所有具有相同日期的记录组合在一起。然后它从分组中选择 'received'
列并将 .transform('all')
应用于它,这将创建一个序列,如果组中的所有值都是真实的(如 1
) 或 False
否则(即如果一个或多个 0
)。最后,.astype(int)
再次将这些布尔值转换为整数(0
或 1
)。生成的系列被分配给一个新的 all_received
列,该列仍然具有相同数量的记录。
grouping = df.groupby([(df.all_received != df.all_received.shift()).cumsum()])
如果 all_received
在当前记录上的值与下一个记录上的值相同(通过将其与同一列进行比较,移位 1),则内部位 df.all_received != df.all_received.shift()
为 1
位置)或 0
否则。然后将所得序列累加(即 [0, 1, 1, 0, 1]
将变为 [0, 1, 2, 2, 3]
)。这意味着结果系列对应于 all_received
中的 1
和 0
的分组,而不改变顺序(就像 .groupby
将它应用到all_received
)。然后根据该临时系列创建原始 df
的分组,这就是您想要的分组。
result = [(last.all_received, first.date[:-3].replace("-", "."), last.date[:-3].replace("-", "."))
for first, last in zip(grouping.nth(0).itertuples(), grouping.nth(-1).itertuples())]
在最后一点中,该分组被使用了两次(这就是单独创建它的原因)。 grouping.nth(0).itertuples()
将为您提供每组的第一条记录,作为值的元组。同样,grouping.nth(-1).itertuples()
为您提供每个组的最后一条记录。通过将这些可迭代对象压缩在一起,您可以获得每组的第一条和最后一条记录对——这正是创建输出所需要的。剩下的只是一个正常的列表理解,采用 first
和 last
的格式化日期以及 last
的 all_received
(也可以使用 first
,他们都在同一个组,所以会是一样的)。
当然,最后 print 语句会以您需要的格式生成输出,尽管在上一步中已经进行了日期格式化。
另一种选择,使用 pandas 并为将来保持一定的可读性 changes/extensions:
df = pd.read_csv('pandas-groupby-date-ranges.csv') # original data
df['theDate'] = pd.to_datetime(df['theDate'], infer_datetime_format=True)
df['year'] = df['theDate'].dt.year
df['month'] = df['theDate'].dt.month
aggregate_df = pd.DataFrame()
for name, group in df.groupby(['year', 'month']):
group['all_received'] = group['ProductReceived'].all()
aggregate_df = pd.concat([aggregate_df, group])
aggregate_df['group'] = (
aggregate_df['all_received'].ne(aggregate_df['all_received'].shift(1)).cumsum()
)
for name, group in aggregate_df.groupby('group'):
group_min = group['theDate'].min()
group_max = group['theDate'].max()
# output to desired format
x = '[x] ' if group['all_received'].iloc[0] == True else '[ ] '
if group_min != group_max:
print(
x
+ str(group_min.year)
+ "."
+ str(group_min.month)
+ ' - '
+ str(group_max.year)
+ "."
+ str(group_max.month)
)
else:
print(x + str(group_min.year) + "." + str(group_min.month))
输出:
[ ] 2020.10 - 2020.12
[x] 2021.1 - 2021.2
[ ] 2021.3