将缺少部分的部分 H:M:S 持续时间转换为秒;或右对齐非 NA 数据
Convert partial H:M:S durations with missing parts, to Seconds; or Right-align non-NA data
TL;DR:我想右对齐这个 df,将 NaN's/shifting 覆盖到左边:
In [6]: series.str.split(':', expand=True)
Out[6]:
0 1 2
0 1 25.842 <NA>
1 <NA> <NA> <NA>
2 0 15.413 <NA>
3 54.154 <NA> <NA>
4 3 2 06.284
将其作为填充最右侧列的连续数据获取:
0 1 2
0 0 1 25.842 # 0 or NA
1 <NA> <NA> <NA> # this NA should remain
2 0 0 15.413
3 0 0 54.154
4 3 2 06.284
我实际想做的事情:
我有 Pandas 个 Durations/timedeltas 系列, 大致 H:M:S 格式 - 但有时 'H' 或 'H:M' 部分可能会丢失 - 所以我不能将它传递给 Timedelta
或 datetime
。我想做的是将它们转换为秒,我已经做到了,但它似乎有点令人费解:
In [1]: import pandas as pd
...:
...: series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')
...: t = series.str.split(':') # not using `expand` helps for the next step
...: t
Out[1]:
0 [1, 25.842]
1 <NA>
2 [0, 15.413]
3 [54.154]
4 [3, 2, 06.284]
dtype: object
In [2]: # reverse it so seconds are first; and NA's are just empty
...: rows = [i[::-1] if i is not pd.NA else [] for i in t]
In [3]: smh = pd.DataFrame.from_records(rows).astype('float')
...: # left-aligned is okay since it's continuous Secs->Mins->Hrs
...: smh
Out[3]:
0 1 2
0 25.842 1.0 NaN
1 NaN NaN NaN
2 15.413 0.0 NaN
3 54.154 NaN NaN
4 6.284 2.0 3.0
如果我不执行此 fillna(0)
步骤,那么它会为稍后的秒转换生成 NaN。
In [4]: smh.iloc[:, 1:] = smh.iloc[:, 1:].fillna(0) # NaN's in first col = NaN from data; so leave
...: # convert to seconds
...: smh.iloc[:, 0] + smh.iloc[:, 1] * 60 + smh.iloc[:, 2] * 3600
Out[4]:
0 85.842
1 NaN
2 15.413
3 54.154
4 10926.284
dtype: float64
^ 预期的最终结果。
(或者,我可以编写一个仅 Python 的小函数来拆分 :
,然后根据每个列表的值数量进行转换。)
让我们尝试使用 numpy
右对齐数据框,基本思想是 sort
数据框沿着 axis=1
以便 NaN
值出现在 non-NaN
个值,同时保持 non-NaN
个值的顺序不变:
i = np.argsort(np.where(df.isna(), -1, 0), 1)
df[:] = np.take_along_axis(df.values, i, axis=1)
0 1 2
0 NaN 1.0 25.842
1 NaN NaN NaN
2 NaN 0.0 15.413
3 NaN NaN 54.154
4 3.0 2.0 6.284
为了获得 total seconds
,您可以将右对齐的数据帧乘以 [3600, 60, 1]
,然后 sum
沿着 axis=1
:
df.mul([3600, 60, 1]).sum(1)
0 85.842
1 0.000
2 15.413
3 54.154
4 10926.284
dtype: float64
您可以通过用 '0:'
填充 series
来更早地解决问题,如下所示:
# setup
series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')
# create a padding of 0 series
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) and c > 0 else '' for c in counts], dtype='string')
# apply padding
res = pad.str.cat(series)
t = res.str.split(':', expand=True)
print(t)
输出
0 1 2
0 0 1 25.842
1 <NA> <NA> <NA>
2 0 0 15.413
3 0 0 54.154
4 3 2 06.284
1. 使用 排序 NA 的 方法 ,我'你想出了这个 - 利用 Pandas apply
和 Python sorted
:
series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')
df = series.str.split(':', expand=True)
# key for sorted is `pd.notna`, so False(0) sorts before True(1)
df.apply(sorted, axis=1, key=pd.notna, result_type='broadcast')
(然后根据需要相乘。)但是很慢,见下文。
2. 通过 预填充 '0:' ,然后我可以创建 pd.Timedelta
直接获取他们的 total_seconds
:
res = ... # from answer
pd.to_timedelta(res, errors='coerce').map(lambda x: x.total_seconds())
(但是在 ~10k 行中执行扩展拆分然后乘法+求和速度更快。)
性能注意事项,10k 行数据:
最初 code/attempt 在我的问题中,行反转 - 所以也许我会坚持下去:
%%timeit
t = series.str.split(':')
rows = [i[::-1] if i is not pd.NA else [] for i in t]
smh = pd.DataFrame.from_records(rows).astype('float')
smh.mul([1, 60, 3600]).sum(axis=1, min_count=1)
# 14.3 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy argsort
+ take_along_axis
:
%%timeit
df = series.str.split(':', expand=True)
i = np.argsort(np.where(df.isna(), -1, 0), 1)
df[:] = np.take_along_axis(df.values, i, axis=1)
df.apply(pd.to_numeric, errors='coerce').mul([3600, 60, 1]).sum(axis=1, min_count=1)
# 30.1 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
预先填充:
%%timeit
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts], dtype='string')
res = pad.str.cat(series)
t = res.str.split(':', expand=True)
t.apply(pd.to_numeric, errors='coerce').mul([3600, 60, 1]).sum(axis=1, min_count=1)
# 48.3 ms ± 607 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
预先填充,timedeltas + total_seconds:
%%timeit
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts], dtype='string')
res = pad.str.cat(series)
pd.to_timedelta(res, errors='coerce').map(lambda x: x.total_seconds())
# 183 ms ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Pandas apply
+ Python sorted
(很慢):
%%timeit
df = series.str.split(':', expand=True)
df = df.apply(sorted, axis=1, key=pd.notna, result_type='broadcast')
df.apply(pd.to_numeric).mul([3600, 60, 1]).sum(axis=1, min_count=1)
# 1.4 s ± 36.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
TL;DR:我想右对齐这个 df,将 NaN's/shifting 覆盖到左边:
In [6]: series.str.split(':', expand=True)
Out[6]:
0 1 2
0 1 25.842 <NA>
1 <NA> <NA> <NA>
2 0 15.413 <NA>
3 54.154 <NA> <NA>
4 3 2 06.284
将其作为填充最右侧列的连续数据获取:
0 1 2
0 0 1 25.842 # 0 or NA
1 <NA> <NA> <NA> # this NA should remain
2 0 0 15.413
3 0 0 54.154
4 3 2 06.284
我实际想做的事情:
我有 Pandas 个 Durations/timedeltas 系列, 大致 H:M:S 格式 - 但有时 'H' 或 'H:M' 部分可能会丢失 - 所以我不能将它传递给 Timedelta
或 datetime
。我想做的是将它们转换为秒,我已经做到了,但它似乎有点令人费解:
In [1]: import pandas as pd
...:
...: series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')
...: t = series.str.split(':') # not using `expand` helps for the next step
...: t
Out[1]:
0 [1, 25.842]
1 <NA>
2 [0, 15.413]
3 [54.154]
4 [3, 2, 06.284]
dtype: object
In [2]: # reverse it so seconds are first; and NA's are just empty
...: rows = [i[::-1] if i is not pd.NA else [] for i in t]
In [3]: smh = pd.DataFrame.from_records(rows).astype('float')
...: # left-aligned is okay since it's continuous Secs->Mins->Hrs
...: smh
Out[3]:
0 1 2
0 25.842 1.0 NaN
1 NaN NaN NaN
2 15.413 0.0 NaN
3 54.154 NaN NaN
4 6.284 2.0 3.0
如果我不执行此 fillna(0)
步骤,那么它会为稍后的秒转换生成 NaN。
In [4]: smh.iloc[:, 1:] = smh.iloc[:, 1:].fillna(0) # NaN's in first col = NaN from data; so leave
...: # convert to seconds
...: smh.iloc[:, 0] + smh.iloc[:, 1] * 60 + smh.iloc[:, 2] * 3600
Out[4]:
0 85.842
1 NaN
2 15.413
3 54.154
4 10926.284
dtype: float64
^ 预期的最终结果。
(或者,我可以编写一个仅 Python 的小函数来拆分 :
,然后根据每个列表的值数量进行转换。)
让我们尝试使用 numpy
右对齐数据框,基本思想是 sort
数据框沿着 axis=1
以便 NaN
值出现在 non-NaN
个值,同时保持 non-NaN
个值的顺序不变:
i = np.argsort(np.where(df.isna(), -1, 0), 1)
df[:] = np.take_along_axis(df.values, i, axis=1)
0 1 2
0 NaN 1.0 25.842
1 NaN NaN NaN
2 NaN 0.0 15.413
3 NaN NaN 54.154
4 3.0 2.0 6.284
为了获得 total seconds
,您可以将右对齐的数据帧乘以 [3600, 60, 1]
,然后 sum
沿着 axis=1
:
df.mul([3600, 60, 1]).sum(1)
0 85.842
1 0.000
2 15.413
3 54.154
4 10926.284
dtype: float64
您可以通过用 '0:'
填充 series
来更早地解决问题,如下所示:
# setup
series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')
# create a padding of 0 series
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) and c > 0 else '' for c in counts], dtype='string')
# apply padding
res = pad.str.cat(series)
t = res.str.split(':', expand=True)
print(t)
输出
0 1 2
0 0 1 25.842
1 <NA> <NA> <NA>
2 0 0 15.413
3 0 0 54.154
4 3 2 06.284
1. 使用 排序 NA 的 方法 apply
和 Python sorted
:
series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')
df = series.str.split(':', expand=True)
# key for sorted is `pd.notna`, so False(0) sorts before True(1)
df.apply(sorted, axis=1, key=pd.notna, result_type='broadcast')
(然后根据需要相乘。)但是很慢,见下文。
2. 通过 预填充 '0:' pd.Timedelta
直接获取他们的 total_seconds
:
res = ... # from answer
pd.to_timedelta(res, errors='coerce').map(lambda x: x.total_seconds())
(但是在 ~10k 行中执行扩展拆分然后乘法+求和速度更快。)
性能注意事项,10k 行数据:
最初 code/attempt 在我的问题中,行反转 - 所以也许我会坚持下去:
%%timeit
t = series.str.split(':')
rows = [i[::-1] if i is not pd.NA else [] for i in t]
smh = pd.DataFrame.from_records(rows).astype('float')
smh.mul([1, 60, 3600]).sum(axis=1, min_count=1)
# 14.3 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy argsort
+ take_along_axis
:
%%timeit
df = series.str.split(':', expand=True)
i = np.argsort(np.where(df.isna(), -1, 0), 1)
df[:] = np.take_along_axis(df.values, i, axis=1)
df.apply(pd.to_numeric, errors='coerce').mul([3600, 60, 1]).sum(axis=1, min_count=1)
# 30.1 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
预先填充:
%%timeit
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts], dtype='string')
res = pad.str.cat(series)
t = res.str.split(':', expand=True)
t.apply(pd.to_numeric, errors='coerce').mul([3600, 60, 1]).sum(axis=1, min_count=1)
# 48.3 ms ± 607 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
预先填充,timedeltas + total_seconds:
%%timeit
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts], dtype='string')
res = pad.str.cat(series)
pd.to_timedelta(res, errors='coerce').map(lambda x: x.total_seconds())
# 183 ms ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Pandas apply
+ Python sorted
(很慢):
%%timeit
df = series.str.split(':', expand=True)
df = df.apply(sorted, axis=1, key=pd.notna, result_type='broadcast')
df.apply(pd.to_numeric).mul([3600, 60, 1]).sum(axis=1, min_count=1)
# 1.4 s ± 36.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)