重采样 CSV 中的时间戳
Resampling timestamps in a CSV
我有一个 CSV 文件,用于存储来自不同智能手机传感器的数据。时间戳是自记录数据的程序启动以来经过的纳秒数。简短示例:
timestamps,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z,labels
25993266,-2.5290375,6.9180603,4.3400116,-2.9009695,7.935462,4.978274,OTHER
28129496,-2.5290375,6.9180603,4.3400116,-2.87558475,7.87134935,5.091722799999999,OTHER
31028666,-2.53741455,6.9312286499999995,4.605766300000001,-2.8502,7.8072367,5.2051716,OTHER
33164897,-2.5457916,6.944397,4.871521,-2.79687885,7.73525185,5.3374355,OTHER
36064067,-2.4727707,6.91207125,5.1803741500000005,-2.7435577,7.663267,5.4696994,OTHER
38200297,-2.3997498,6.8797455,5.4892273,-2.6648885,7.59296125,5.6024062,OTHER
41099467,-2.25849155,6.85580445,5.79090115,-2.5862193,7.5226555,5.735113,OTHER
43235697,-2.1172333,6.8318634,6.092575,-2.50272225,7.45811375,5.85305635,OTHER
46134867,-1.9903412,6.810318,6.32122035,-2.4192252,7.393572,5.9709997,OTHER
时间戳之间的时间步长不相等,但我希望它们相等。我的问题是如何实现这一目标?
我正在考虑使用下面的代码简单地将纳秒采样到微秒。这是我第一次尝试 return 执行期间没有错误,但它 return 是一个没有时间戳的 CSV 文件,第一行之后的每一行都是空的。
series = pandas.read_csv("file3.csv", header=0, index_col=0, squeeze=True, nrows=1000)
series.index = pandas.to_datetime(series.index, unit='ns')
downsampled = series.resample("U").mean()
downsampled.to_csv("file4.csv", index=False)
我将感谢改进我的代码的方法以及实现我的总体目标的其他想法。
我认为这行不正确
series.index = pandas.to_datetime(series.index, unit='ns')
应该使用时间戳而不是索引
series.index = pandas.to_datetime(series.timestamps, unit='ns')
这是结果
timestamps acce_x acce_y acce_z grav_x grav_y grav_z
timestamps
1970-01-01 00:00:00.025993 25993266.000 -2.529 6.918 4.340 -2.901 7.935 4.978
1970-01-01 00:00:00.025994 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.025995 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.025996 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.025997 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
1970-01-01 00:00:00.046130 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.046131 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.046132 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.046133 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.046134 46134867.000 -1.990 6.810 6.321 -2.419 7.394 5.971
当您在毫秒内重新采样时,没有足够的值来填充连续的桶,因此您最终得到 NaN。
如果您希望时间步长相等,同时还填充了所有存储桶,您可以找到最大差异并将其用作重采样率:
首先,将索引设置为 Timedelta
,因为它是自应用程序启动以来经过的时间。
df.index = df.index.map(lambda t: pd.Timedelta(t, unit='ns'))
df.index
# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028129496',
'0 days 00:00:00.031028666', '0 days 00:00:00.033164897',
'0 days 00:00:00.036064067', '0 days 00:00:00.038200297',
'0 days 00:00:00.041099467', '0 days 00:00:00.043235697',
'0 days 00:00:00.046134867'],
dtype='timedelta64[ns]', name='timestamps', freq=None)
接下来,重新采样:
import numpy as np
max_diff = np.diff(df.index).max()
# numpy.timedelta64(2899170,'ns')
# convert to pandas.Timedelta to use it with `resample`
dfr = df.resample(pd.Timedelta(max_diff)).mean()
dfr
输出:
acce_x acce_y acce_z grav_x grav_y grav_z
timestamps
0 days 00:00:00.025993266 -2.529037 6.918060 4.340012 -2.888277 7.903406 5.034998
0 days 00:00:00.028892436 -2.537415 6.931229 4.605766 -2.850200 7.807237 5.205172
0 days 00:00:00.031791606 -2.545792 6.944397 4.871521 -2.796879 7.735252 5.337435
0 days 00:00:00.034690776 -2.472771 6.912071 5.180374 -2.743558 7.663267 5.469699
0 days 00:00:00.037589946 -2.399750 6.879746 5.489227 -2.664888 7.592961 5.602406
0 days 00:00:00.040489116 -2.187862 6.843834 5.941738 -2.544471 7.490385 5.794085
0 days 00:00:00.043388286 -1.990341 6.810318 6.321220 -2.419225 7.393572 5.971000
为了验证您的索引是否均匀分布,它有 freq='2899170N'
:
dfr.index
# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028892436',
'0 days 00:00:00.031791606', '0 days 00:00:00.034690776',
'0 days 00:00:00.037589946', '0 days 00:00:00.040489116',
'0 days 00:00:00.043388286'],
dtype='timedelta64[ns]', name='timestamps', freq='2899170N')
或通过 diff 检查:
np.diff(dfr.index)
# output:
array([2899170, 2899170, 2899170, 2899170, 2899170, 2899170],
dtype='timedelta64[ns]')
我有一个 CSV 文件,用于存储来自不同智能手机传感器的数据。时间戳是自记录数据的程序启动以来经过的纳秒数。简短示例:
timestamps,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z,labels
25993266,-2.5290375,6.9180603,4.3400116,-2.9009695,7.935462,4.978274,OTHER
28129496,-2.5290375,6.9180603,4.3400116,-2.87558475,7.87134935,5.091722799999999,OTHER
31028666,-2.53741455,6.9312286499999995,4.605766300000001,-2.8502,7.8072367,5.2051716,OTHER
33164897,-2.5457916,6.944397,4.871521,-2.79687885,7.73525185,5.3374355,OTHER
36064067,-2.4727707,6.91207125,5.1803741500000005,-2.7435577,7.663267,5.4696994,OTHER
38200297,-2.3997498,6.8797455,5.4892273,-2.6648885,7.59296125,5.6024062,OTHER
41099467,-2.25849155,6.85580445,5.79090115,-2.5862193,7.5226555,5.735113,OTHER
43235697,-2.1172333,6.8318634,6.092575,-2.50272225,7.45811375,5.85305635,OTHER
46134867,-1.9903412,6.810318,6.32122035,-2.4192252,7.393572,5.9709997,OTHER
时间戳之间的时间步长不相等,但我希望它们相等。我的问题是如何实现这一目标? 我正在考虑使用下面的代码简单地将纳秒采样到微秒。这是我第一次尝试 return 执行期间没有错误,但它 return 是一个没有时间戳的 CSV 文件,第一行之后的每一行都是空的。
series = pandas.read_csv("file3.csv", header=0, index_col=0, squeeze=True, nrows=1000)
series.index = pandas.to_datetime(series.index, unit='ns')
downsampled = series.resample("U").mean()
downsampled.to_csv("file4.csv", index=False)
我将感谢改进我的代码的方法以及实现我的总体目标的其他想法。
我认为这行不正确
series.index = pandas.to_datetime(series.index, unit='ns')
应该使用时间戳而不是索引
series.index = pandas.to_datetime(series.timestamps, unit='ns')
这是结果
timestamps acce_x acce_y acce_z grav_x grav_y grav_z
timestamps
1970-01-01 00:00:00.025993 25993266.000 -2.529 6.918 4.340 -2.901 7.935 4.978
1970-01-01 00:00:00.025994 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.025995 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.025996 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.025997 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
1970-01-01 00:00:00.046130 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.046131 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.046132 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.046133 NaN NaN NaN NaN NaN NaN NaN
1970-01-01 00:00:00.046134 46134867.000 -1.990 6.810 6.321 -2.419 7.394 5.971
当您在毫秒内重新采样时,没有足够的值来填充连续的桶,因此您最终得到 NaN。
如果您希望时间步长相等,同时还填充了所有存储桶,您可以找到最大差异并将其用作重采样率:
首先,将索引设置为 Timedelta
,因为它是自应用程序启动以来经过的时间。
df.index = df.index.map(lambda t: pd.Timedelta(t, unit='ns'))
df.index
# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028129496',
'0 days 00:00:00.031028666', '0 days 00:00:00.033164897',
'0 days 00:00:00.036064067', '0 days 00:00:00.038200297',
'0 days 00:00:00.041099467', '0 days 00:00:00.043235697',
'0 days 00:00:00.046134867'],
dtype='timedelta64[ns]', name='timestamps', freq=None)
接下来,重新采样:
import numpy as np
max_diff = np.diff(df.index).max()
# numpy.timedelta64(2899170,'ns')
# convert to pandas.Timedelta to use it with `resample`
dfr = df.resample(pd.Timedelta(max_diff)).mean()
dfr
输出:
acce_x acce_y acce_z grav_x grav_y grav_z
timestamps
0 days 00:00:00.025993266 -2.529037 6.918060 4.340012 -2.888277 7.903406 5.034998
0 days 00:00:00.028892436 -2.537415 6.931229 4.605766 -2.850200 7.807237 5.205172
0 days 00:00:00.031791606 -2.545792 6.944397 4.871521 -2.796879 7.735252 5.337435
0 days 00:00:00.034690776 -2.472771 6.912071 5.180374 -2.743558 7.663267 5.469699
0 days 00:00:00.037589946 -2.399750 6.879746 5.489227 -2.664888 7.592961 5.602406
0 days 00:00:00.040489116 -2.187862 6.843834 5.941738 -2.544471 7.490385 5.794085
0 days 00:00:00.043388286 -1.990341 6.810318 6.321220 -2.419225 7.393572 5.971000
为了验证您的索引是否均匀分布,它有 freq='2899170N'
:
dfr.index
# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028892436',
'0 days 00:00:00.031791606', '0 days 00:00:00.034690776',
'0 days 00:00:00.037589946', '0 days 00:00:00.040489116',
'0 days 00:00:00.043388286'],
dtype='timedelta64[ns]', name='timestamps', freq='2899170N')
或通过 diff 检查:
np.diff(dfr.index)
# output:
array([2899170, 2899170, 2899170, 2899170, 2899170, 2899170],
dtype='timedelta64[ns]')