小时、日期、天数计算

Hours, Date, Day Count Calculation

我有这个庞大的数据集,其中包含几天的日期和时间戳。日期时间格式为 UNIX 格式。数据集是一些登录的日志。

该代码应该对开始和结束时间日志进行分组,并提供日志计数和唯一 ID 计数。

我正在尝试获取一些统计信息,例如:

total log counts per hour & unique login ids per hour. 

日志计数,可选择小时数,即一周的 24hrs, 12hrs, 6 hrs, 1 hr, etcday 以及此类选项。

我可以将数据拆分为 startend 小时,但我无法获得 logsunique ids 的统计数据。

代码:

from datetime import datetime,time

# This splits data from start to end time 
start = time(8,0,0)
end =   time(20,0,0)

    with open('input', 'r') as infile, open('output','w') as outfile:
        for row in infile:
            col = row.split()
            t1 = datetime.fromtimestamp(float(col[2])).time()
            t2 = datetime.fromtimestamp(float(col[3])).time()
            print (t1 >= start and t2 <= end)

输入数据格式:数据没有headers但字段如下。输入的天数未知。

UserID, StartTime, StopTime, GPS1, GPS2
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525

预期输出:示例输出

StartTime, EndTime, Day, LogCount, UniqueIDCount

00:00:00, 01:00:00, Mon, 349, 30  

StartTime and Endtime = Human readable format

仅将数据与时间范围分开已经实现,但我正在尝试写一个舍入时间并计算日志和uniqueids的计数。也欢迎使用 Pandas 的解决方案。

编辑一:我更详细

StartTime         --> EndTIime
1/5/2004, 5:30:01 --> 1/5/2004, 5:30:03

但这介于 5:00:00 --> 6:00:00 之间。所以这种时间范围内所有日志的计数就是我想要找到的。同样对于其他人也喜欢

5:00:00 --> 6:00:00 Hourly Count 
00:00:00 --> 6:00:00 Every 6 hours 
00:00:00 --> 12:00:00 Every 12 hours 

5 Jan 2004, Mon --> count 
6 Jan 2004, Tue --> Count

等等 寻找一个可以根据需要更改 time/hours 范围的通用程序。

不幸的是我找不到任何优雅的解决方案。

这是我的尝试:

fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn, header=None, names=cols)

df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime

# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)

# building reporting DF: `r`
freq = '1H'  # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)

# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1

r['LogCount'] = 0
r['UniqueIDCount'] = 0


for i, row in r.iterrows():
        # intervals overlap test
        # https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
        # i've slightly simplified the calculations of m and d
        # by getting rid of division by 2,
        # because it can be done eliminating common terms
    u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
    r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]

r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time

print(r[r.LogCount > 0])

PS 报告 DF - r 中的周期越少,计算速度就越快。因此,如果您事先知道这些时间范围不包含任何数据(例如在周末、节假日等期间),您可能想要删除行(时间)

结果:

                          start  LogCount  UniqueIDCount  Day StartTime   EndTime
2004-01-05 00:00:00  1073260800        24             15  Mon  00:00:00  01:00:00
2004-01-05 01:00:00  1073264400         5              5  Mon  01:00:00  02:00:00
2004-01-05 02:00:00  1073268000         3              3  Mon  02:00:00  03:00:00
2004-01-05 03:00:00  1073271600         3              3  Mon  03:00:00  04:00:00
2004-01-05 04:00:00  1073275200         2              2  Mon  04:00:00  05:00:00
2004-01-06 12:00:00  1073390400        22             12  Tue  12:00:00  13:00:00
2004-01-06 13:00:00  1073394000         3              2  Tue  13:00:00  14:00:00
2004-01-06 14:00:00  1073397600         3              2  Tue  14:00:00  15:00:00
2004-01-06 15:00:00  1073401200         3              2  Tue  15:00:00  16:00:00
2004-01-10 16:00:00  1073750400        20             11  Sat  16:00:00  17:00:00
2004-01-14 23:00:00  1074121200       218             69  Wed  23:00:00  00:00:00
2004-01-15 00:00:00  1074124800        12             11  Thu  00:00:00  01:00:00
2004-01-15 01:00:00  1074128400         1              1  Thu  01:00:00  02:00:00