根据 属性 将对象列表高效 Python 分组到日期范围内

Efficient Python grouping of object list into date ranges based on property

我写了一些代码,它接受一个“订单”对象的输入列表,看起来像这样...

**Company, theDate, Product, ProductReceived**
Apple, 2020-10-01, Subscription, 0
Apple, 2020-10-01, Trial, 0
Apple, 2020-11-01, Subscription, 0
Apple, 2020-11-01, Trial, 1
Apple, 2020-12-01, Subscription, 1
Apple, 2020-12-01, Trial, 0
Apple, 2021-01-01, Subscription, 1
Apple, 2021-01-01, Trial, 1
Apple, 2021-02-01, Subscription, 1
Apple, 2021-02-01, Trial, 1
Apple, 2021-03-01, Subscription, 0
Apple, 2021-03-01, Trial, 1

并将其转换为如下所示的简单字符串输出...

[ ] 2020.10 - 2020.12
[x] 2021.01 - 2021.02
[ ] 2021.03

其想法是根据是否已收到给定日期的所有产品,将所有日期分组在连续范围内。

代码有效,但感觉非常笨拙,我怀疑 python/pandas 可能为这样的任务提供近乎开箱即用的解决方案。非常感谢对最优雅的解决方案的任何想法,因为此代码(从我的真实示例中简化而来)已经变得难以扩展。这是一个完整的工作示例...

class orderObject:
    def __init__(self,company, theDate, product, productReceived):
        self.company = company
        self.theDate = theDate  
        self.product = product  
        self.productReceived = productReceived

orderList = [orderObject('Apple','2020-10-01','Subscription',0)
              , orderObject('Apple','2020-10-01','Trial',0)
              , orderObject('Apple','2020-11-01','Subscription',0)
              , orderObject('Apple','2020-11-01','Trial',1)
              , orderObject('Apple','2020-12-01','Subscription',1)
              , orderObject('Apple','2020-12-01','Trial',0)
              , orderObject('Apple','2021-01-01','Subscription',1)
              , orderObject('Apple','2021-01-01','Trial',1)
              , orderObject('Apple','2021-02-01','Subscription',1)
              , orderObject('Apple','2021-02-01','Trial',1)
              , orderObject('Apple','2021-03-01','Subscription',0)
              , orderObject('Apple','2021-03-01','Trial',1)    
             ]

dateSet = {}
for o in orderList:
    dateSet[o.theDate] = 1 #loop through once to assume every date is complete
   
for o in orderList:  #loop through a second time, and take any incomplete date/product as a fail (i.e. all products must be complete for the date to be complete)
    if o.productReceived < 1:
        dateSet[o.theDate] = 0            
#now dateSet contains references to every date and whether we have received products for ALL products at that date, e.g. attr: 2020-10-01, value: 0 etc

def updateCompanyOrderGrouped(lcGroup):
    indexText = startGroup[:-3].replace("-",".")
    if (startGroup!=endGroup):
        indexText = indexText + " - " + endGroup[:-3].replace("-",".")
    if (lastValue==0):
        lcGroup += "\n[ ] " + indexText
    else:
        lcGroup += "\n[x] " + indexText
    return lcGroup

companyOrderGrouped = "" #string to store grouping result
lastValue = -1 #start with a "last value" that won't match any current value
startGroup = endGroup = "NA" #start with startGroup and endGroup that won't match any current groups

for attr, value in dateSet.items():
    if value != lastValue:
        if startGroup != "NA":
            #we just ended a grouping because the value changed...
            companyOrderGrouped = updateCompanyOrderGrouped(companyOrderGrouped)
        startGroup = attr
    #whatever happens, update the endGroup + value to keep track of the last record
    endGroup = attr
    lastValue = value

print(updateCompanyOrderGrouped(companyOrderGrouped).lstrip('\n')) #finish with last call to updateCompanyOrderGrouped to end the final group

您可能正在寻找这样的东西:

from pandas import DataFrame

order_columns = ['company', 'date', 'product', 'received']
order_data = [
    ['Apple', '2020-10-01', 'Subscription', 0],
    ['Apple', '2020-10-01', 'Trial', 0],
    ['Apple', '2020-11-01', 'Subscription', 0],
    ['Apple', '2020-11-01', 'Trial', 1],
    ['Apple', '2020-12-01', 'Subscription', 1],
    ['Apple', '2020-12-01', 'Trial', 0],
    ['Apple', '2021-01-01', 'Subscription', 1],
    ['Apple', '2021-01-01', 'Trial', 1],
    ['Apple', '2021-02-01', 'Subscription', 1],
    ['Apple', '2021-02-01', 'Trial', 1],
    ['Apple', '2021-03-01', 'Subscription', 0],
    ['Apple', '2021-03-01', 'Trial', 1]
]
df = DataFrame(order_data, columns=order_columns)

# create an extra column all_received that is 0 or 1 for all products received on a date
df['all_received'] = df.groupby(by='date')['received'].transform('all').astype(int)
# grouping that uses a temporary series changing value every time all_received changes
grouping = df.groupby([(df.all_received != df.all_received.shift()).cumsum()])
# from that grouping, the value and date of every first + date of every last
result = [(last.all_received, first.date[:-3].replace("-", "."), last.date[:-3].replace("-", "."))
          for first, last in zip(grouping.nth(0).itertuples(), grouping.nth(-1).itertuples())]

# print similar to your format:
for value, start, end in result:
    print(f'[{"x" if value else " "}] {start}{" - " + end if end != start else ""}')

因此,除了数据的定义之外,您是对的:它是三行代码,但我认为您的解决方案更具可读性。阅读这些密集的 pandas 代码行并了解发生了什么需要一些时间。

请注意,我将一些变量重命名为全部小写并带有下划线,这是 Python 的推荐命名约定。

输出:

[ ] 2020.10 - 2020.12
[x] 2021.01 - 2021.02
[ ] 2021.03

由于您评论说这有点超出您的 pandas 理解范围,这里有一些背景知识:

df['all_received'] = df.groupby(by='date')['received'].transform('all').astype(int)

.groupby 通过 'date'df 中创建了一组记录,即将所有具有相同日期的记录组合在一起。然后它从分组中选择 'received' 列并将 .transform('all') 应用于它,这将创建一个序列,如果组中的所有值都是真实的(如 1) 或 False 否则(即如果一个或多个 0)。最后,.astype(int) 再次将这些布尔值转换为整数(01)。生成的系列被分配给一个新的 all_received 列,该列仍然具有相同数量的记录。

grouping = df.groupby([(df.all_received != df.all_received.shift()).cumsum()])

如果 all_received 在当前记录上的值与下一个记录上的值相同(通过将其与同一列进行比较,移位 1),则内部位 df.all_received != df.all_received.shift()1位置)或 0 否则。然后将所得序列累加(即 [0, 1, 1, 0, 1] 将变为 [0, 1, 2, 2, 3])。这意味着结果系列对应于 all_received 中的 10 的分组,而不改变顺序(就像 .groupby 将它应用到all_received)。然后根据该临时系列创建原始 df 的分组,这就是您想要的分组。

result = [(last.all_received, first.date[:-3].replace("-", "."), last.date[:-3].replace("-", "."))
          for first, last in zip(grouping.nth(0).itertuples(), grouping.nth(-1).itertuples())]

在最后一点中,该分组被使用了两次(这就是单独创建它的原因)。 grouping.nth(0).itertuples() 将为您提供每组的第一条记录,作为值的元组。同样,grouping.nth(-1).itertuples() 为您提供每个组的最后一条记录。通过将这些可迭代对象压缩在一起,您可以获得每组的第一条和最后一条记录对——这正是创建输出所需要的。剩下的只是一个正常的列表理解,采用 firstlast 的格式化日期以及 lastall_received (也可以使用 first,他们都在同一个组,所以会是一样的)。

当然,最后 print 语句会以您需要的格式生成输出,尽管在上一步中已经进行了日期格式化。

另一种选择,使用 pandas 并为将来保持一定的可读性 changes/extensions:

df = pd.read_csv('pandas-groupby-date-ranges.csv')  # original data

df['theDate'] = pd.to_datetime(df['theDate'], infer_datetime_format=True)
df['year'] = df['theDate'].dt.year
df['month'] = df['theDate'].dt.month

aggregate_df = pd.DataFrame()
for name, group in df.groupby(['year', 'month']):
    group['all_received'] = group['ProductReceived'].all()
    aggregate_df = pd.concat([aggregate_df, group])

aggregate_df['group'] = (
    aggregate_df['all_received'].ne(aggregate_df['all_received'].shift(1)).cumsum()
)

for name, group in aggregate_df.groupby('group'):
    group_min = group['theDate'].min()
    group_max = group['theDate'].max()

    # output to desired format
    x = '[x] ' if group['all_received'].iloc[0] == True else '[ ] '
    if group_min != group_max:
        print(
            x
            + str(group_min.year)
            + "."
            + str(group_min.month)
            + ' - '
            + str(group_max.year)
            + "."
            + str(group_max.month)
        )
    else:
        print(x + str(group_min.year) + "." + str(group_min.month))

输出:

[ ] 2020.10 - 2020.12
[x] 2021.1 - 2021.2
[ ] 2021.3