重采样 CSV 中的时间戳

Resampling timestamps in a CSV

我有一个 CSV 文件,用于存储来自不同智能手机传感器的数据。时间戳是自记录数据的程序启动以来经过的纳秒数。简短示例:

timestamps,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z,labels
25993266,-2.5290375,6.9180603,4.3400116,-2.9009695,7.935462,4.978274,OTHER
28129496,-2.5290375,6.9180603,4.3400116,-2.87558475,7.87134935,5.091722799999999,OTHER
31028666,-2.53741455,6.9312286499999995,4.605766300000001,-2.8502,7.8072367,5.2051716,OTHER
33164897,-2.5457916,6.944397,4.871521,-2.79687885,7.73525185,5.3374355,OTHER
36064067,-2.4727707,6.91207125,5.1803741500000005,-2.7435577,7.663267,5.4696994,OTHER
38200297,-2.3997498,6.8797455,5.4892273,-2.6648885,7.59296125,5.6024062,OTHER
41099467,-2.25849155,6.85580445,5.79090115,-2.5862193,7.5226555,5.735113,OTHER
43235697,-2.1172333,6.8318634,6.092575,-2.50272225,7.45811375,5.85305635,OTHER
46134867,-1.9903412,6.810318,6.32122035,-2.4192252,7.393572,5.9709997,OTHER

时间戳之间的时间步长不相等,但我希望它们相等。我的问题是如何实现这一目标? 我正在考虑使用下面的代码简单地将纳秒采样到微秒。这是我第一次尝试 return 执行期间没有错误,但它 return 是一个没有时间戳的 CSV 文件,第一行之后的每一行都是空的。

series = pandas.read_csv("file3.csv", header=0, index_col=0, squeeze=True, nrows=1000)
series.index = pandas.to_datetime(series.index, unit='ns')
downsampled = series.resample("U").mean()
downsampled.to_csv("file4.csv", index=False)

我将感谢改进我的代码的方法以及实现我的总体目标的其他想法。

我认为这行不正确

series.index = pandas.to_datetime(series.index, unit='ns')

应该使用时间戳而不是索引

series.index = pandas.to_datetime(series.timestamps, unit='ns')

这是结果

                             timestamps  acce_x  acce_y  acce_z  grav_x  grav_y  grav_z
timestamps
1970-01-01 00:00:00.025993 25993266.000  -2.529   6.918   4.340  -2.901   7.935   4.978
1970-01-01 00:00:00.025994          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.025995          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.025996          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.025997          NaN     NaN     NaN     NaN     NaN     NaN     NaN
...                                 ...     ...     ...     ...     ...     ...     ...
1970-01-01 00:00:00.046130          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.046131          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.046132          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.046133          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.046134 46134867.000  -1.990   6.810   6.321  -2.419   7.394   5.971

当您在毫秒内重新采样时,没有足够的值来填充连续的桶,因此您最终得到 NaN。

如果您希望时间步长相等,同时还填充了所有存储桶,您可以找到最大差异并将其用作重采样率:

首先,将索引设置为 Timedelta,因为它是自应用程序启动以来经过的时间。

df.index = df.index.map(lambda t: pd.Timedelta(t, unit='ns'))
df.index

# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028129496',
                '0 days 00:00:00.031028666', '0 days 00:00:00.033164897',
                '0 days 00:00:00.036064067', '0 days 00:00:00.038200297',
                '0 days 00:00:00.041099467', '0 days 00:00:00.043235697',
                '0 days 00:00:00.046134867'],
               dtype='timedelta64[ns]', name='timestamps', freq=None)

接下来,重新采样:

import numpy as np

max_diff = np.diff(df.index).max()
# numpy.timedelta64(2899170,'ns')

# convert to pandas.Timedelta to use it with `resample`
dfr = df.resample(pd.Timedelta(max_diff)).mean()
dfr

输出:

                             acce_x    acce_y    acce_z    grav_x    grav_y    grav_z
timestamps                                                                           
0 days 00:00:00.025993266 -2.529037  6.918060  4.340012 -2.888277  7.903406  5.034998
0 days 00:00:00.028892436 -2.537415  6.931229  4.605766 -2.850200  7.807237  5.205172
0 days 00:00:00.031791606 -2.545792  6.944397  4.871521 -2.796879  7.735252  5.337435
0 days 00:00:00.034690776 -2.472771  6.912071  5.180374 -2.743558  7.663267  5.469699
0 days 00:00:00.037589946 -2.399750  6.879746  5.489227 -2.664888  7.592961  5.602406
0 days 00:00:00.040489116 -2.187862  6.843834  5.941738 -2.544471  7.490385  5.794085
0 days 00:00:00.043388286 -1.990341  6.810318  6.321220 -2.419225  7.393572  5.971000

为了验证您的索引是否均匀分布,它有 freq='2899170N':

dfr.index
# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028892436',
                '0 days 00:00:00.031791606', '0 days 00:00:00.034690776',
                '0 days 00:00:00.037589946', '0 days 00:00:00.040489116',
                '0 days 00:00:00.043388286'],
               dtype='timedelta64[ns]', name='timestamps', freq='2899170N')

或通过 diff 检查:

np.diff(dfr.index)
# output:
array([2899170, 2899170, 2899170, 2899170, 2899170, 2899170],
      dtype='timedelta64[ns]')