如何使用列表中的相同 id 元素处理日期差异

How to process a date diference using same id elements in a list

我有以下数据结构:

[ (19L, datetime.datetime(2015, 2, 11, 12, 3, 43)),
  (19L, datetime.datetime(2015, 2, 12, 16, 28, 48)),
  (19L, datetime.datetime(2014, 9, 17, 11, 58, 19)),
  (80L, datetime.datetime(2014, 9, 15, 12, 54, 36)),
  (80L, datetime.datetime(2014, 9, 15, 14, 16, 39)),
  (80L, datetime.datetime(2014, 2, 6, 8, 58, 39)),
  (80L, datetime.datetime(2014, 9, 8, 14, 21, 48)),
  (90L, datetime.datetime(2016, 8, 2, 18, 14, 31)),
  (90L, datetime.datetime(2016, 8, 2, 21, 14, 23)),
  (90L, datetime.datetime(2014, 1, 5, 16, 35, 34))  ]

我需要计算具有相同 ID 的用户之间的平均天数,第一个元素对应于用户 ID,第二个元素对应于日期时间。

我在如何遍历列表、计算每个用户并获得相同的差异方面遇到了麻烦...

您可以使用 itertools.groupby() 按用户 ID 分组(假设列表按分组键排序 - 看起来是这样),然后,您可以使用 "pairwise" 迭代并计算一个平均日差:

In [1]: import datetime
In [2]: from operator import itemgetter
In [3]: from itertools import groupby, combinations

In [4]: l = [ 
   ...:   (19L, datetime.datetime(2015, 2, 11, 12, 3, 43)),
   ...:   (19L, datetime.datetime(2015, 2, 12, 16, 28, 48)),
   ...:   (19L, datetime.datetime(2014, 9, 17, 11, 58, 19)),
   ...:   (80L, datetime.datetime(2014, 9, 15, 12, 54, 36)),
   ...:   (80L, datetime.datetime(2014, 9, 15, 14, 16, 39)),
   ...:   (80L, datetime.datetime(2014, 2, 6, 8, 58, 39)),
   ...:   (80L, datetime.datetime(2014, 9, 8, 14, 21, 48)),
   ...:   (90L, datetime.datetime(2016, 8, 2, 18, 14, 31)),
   ...:   (90L, datetime.datetime(2016, 8, 2, 21, 14, 23)),
   ...:   (90L, datetime.datetime(2014, 1, 5, 16, 35, 34))  ]

In [5]: for user_id, dates in groupby(l, itemgetter(0)):
    ...:     dates = [date[1] for date in dates]
    ...:     differences = [abs((d1 - d2).days) for d1, d2 in zip(dates[0::2], dates[1::2])]
    ...:     print(user_id, sum(differences) / len(differences))
    ...:     
(19L, 2)
(80L, 108)
(90L, 1)

我会将时间戳排序到字典中,其中每个键都是用户的 ID,值是访问时间的列表。然后在对时间戳列表进行排序后,找到每次访问时间之间的差异并找到平均值。 datetime.timedelta 对象可用于简化时间戳的数学运算..

from collections import defaultdict
from datetime import datetime

#l = [(id, datetime), (...), ...]

d = defaultdict(list)
for ID, time in l:
    d[ID].append(time) # build list of times from timestamps
    d[ID].sort() # sorting every time is not optimal but functional

for ID in d.keys():
    timeDeltas = [d[ID][i+1] - d[ID][i] for i in range(len(d[ID])-1)] # create list of timedeltas
    averageVisitFrequency = reduce(lambda x,y: x+y, timeDeltas)//len(timeDeltas) # calculate average timedelta
    print 'user {} makes a purchase every {} days on average'.format(ID, averageVisitFrequency.days) # example output usage