将缺少部分的部分 H:M:S 持续时间转换为秒;或右对齐非 NA 数据

Convert partial H:M:S durations with missing parts, to Seconds; or Right-align non-NA data

TL;DR:我想右对齐这个 df,将 NaN's/shifting 覆盖到左边:

In [6]: series.str.split(':', expand=True)
Out[6]:
        0       1       2
0       1  25.842    <NA>
1    <NA>    <NA>    <NA>
2       0  15.413    <NA>
3  54.154    <NA>    <NA>
4       3       2  06.284

将其作为填充最右侧列的连续数据获取:

        0       1       2
0       0       1  25.842  # 0 or NA
1    <NA>    <NA>    <NA>  # this NA should remain
2       0       0  15.413
3       0       0  54.154
4       3       2  06.284

我实际想做的事情:

我有 Pandas 个 Durations/timedeltas 系列, 大致 H:M:S 格式 - 但有时 'H' 或 'H:M' 部分可能会丢失 - 所以我不能将它传递给 Timedeltadatetime。我想做的是将它们转换为秒,我已经做到了,但它似乎有点令人费解:

In [1]: import pandas as pd
   ...:
   ...: series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')
   ...: t = series.str.split(':')  # not using `expand` helps for the next step
   ...: t
Out[1]:
0       [1, 25.842]
1              <NA>
2       [0, 15.413]
3          [54.154]
4    [3, 2, 06.284]
dtype: object

In [2]: # reverse it so seconds are first; and NA's are just empty
   ...: rows = [i[::-1] if i is not pd.NA else [] for i in t]

In [3]: smh = pd.DataFrame.from_records(rows).astype('float')
   ...: # left-aligned is okay since it's continuous Secs->Mins->Hrs
   ...: smh
Out[3]:
        0    1    2
0  25.842  1.0  NaN
1     NaN  NaN  NaN
2  15.413  0.0  NaN
3  54.154  NaN  NaN
4   6.284  2.0  3.0

如果我不执行此 fillna(0) 步骤,那么它会为稍后的秒转换生成 NaN。

In [4]: smh.iloc[:, 1:] = smh.iloc[:, 1:].fillna(0)  # NaN's in first col = NaN from data; so leave
   ...: # convert to seconds
   ...: smh.iloc[:, 0] + smh.iloc[:, 1] * 60 + smh.iloc[:, 2] * 3600
Out[4]:
0       85.842
1          NaN
2       15.413
3       54.154
4    10926.284
dtype: float64

^ 预期的最终结果。

(或者,我可以编写一个仅 Python 的小函数来拆分 :,然后根据每个列表的值数量进行转换。)

让我们尝试使用 numpy 右对齐数据框,基本思想是 sort 数据框沿着 axis=1 以便 NaN 值出现在 non-NaN 个值,同时保持 non-NaN 个值的顺序不变:

i = np.argsort(np.where(df.isna(), -1, 0), 1)
df[:] = np.take_along_axis(df.values, i, axis=1)


     0    1       2
0  NaN  1.0  25.842
1  NaN  NaN     NaN
2  NaN  0.0  15.413
3  NaN  NaN  54.154
4  3.0  2.0   6.284

为了获得 total seconds,您可以将右对齐的数据帧乘以 [3600, 60, 1],然后 sum 沿着 axis=1:

df.mul([3600, 60, 1]).sum(1)

0       85.842
1        0.000
2       15.413
3       54.154
4    10926.284
dtype: float64

您可以通过用 '0:' 填充 series 来更早地解决问题,如下所示:

# setup
series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')

# create a padding of 0 series
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) and c > 0 else '' for c in counts], dtype='string')

# apply padding
res = pad.str.cat(series)

t = res.str.split(':', expand=True)
print(t)

输出

      0     1       2
0     0     1  25.842
1  <NA>  <NA>    <NA>
2     0     0  15.413
3     0     0  54.154
4     3     2  06.284

1. 使用 排序 NA 的 方法 ,我'你想出了这个 - 利用 Pandas apply 和 Python sorted :

series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')
df = series.str.split(':', expand=True)

# key for sorted is `pd.notna`, so False(0) sorts before True(1)
df.apply(sorted, axis=1, key=pd.notna, result_type='broadcast')

(然后根据需要相乘。)但是很慢,见下文。

2. 通过 预填充 '0:' ,然后我可以创建 pd.Timedelta 直接获取他们的 total_seconds:

res = ...  # from answer

pd.to_timedelta(res, errors='coerce').map(lambda x: x.total_seconds())

(但是在 ~10k 行中执行扩展拆分然后乘法+求和速度更快。)


性能注意事项,10k 行数据:

最初 code/attempt 在我的问题中,行反转 - 所以也许我会坚持下去:

%%timeit
t = series.str.split(':')
rows = [i[::-1] if i is not pd.NA else [] for i in t]
smh = pd.DataFrame.from_records(rows).astype('float')
smh.mul([1, 60, 3600]).sum(axis=1, min_count=1)

# 14.3 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy argsort + take_along_axis:

%%timeit
df = series.str.split(':', expand=True)
i = np.argsort(np.where(df.isna(), -1, 0), 1)
df[:] = np.take_along_axis(df.values, i, axis=1)
df.apply(pd.to_numeric, errors='coerce').mul([3600, 60, 1]).sum(axis=1, min_count=1)

# 30.1 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

预先填充:

%%timeit
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts], dtype='string')
res = pad.str.cat(series)
t = res.str.split(':', expand=True)
t.apply(pd.to_numeric, errors='coerce').mul([3600, 60, 1]).sum(axis=1, min_count=1)

# 48.3 ms ± 607 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

预先填充,timedeltas + total_seconds:

%%timeit
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts], dtype='string')
res = pad.str.cat(series)

pd.to_timedelta(res, errors='coerce').map(lambda x: x.total_seconds())

# 183 ms ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas apply + Python sorted (很慢):

%%timeit
df = series.str.split(':', expand=True)
df = df.apply(sorted, axis=1, key=pd.notna, result_type='broadcast')
df.apply(pd.to_numeric).mul([3600, 60, 1]).sum(axis=1, min_count=1)

# 1.4 s ± 36.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)