从稀疏日期时间索引获取范围

Get range from sparse datetimeindex

我为大型数据库中的每个用户提供了这种 pandas DataFrame。

每行是一个长度为[start_date,end_date]的周期,但有时连续2行实际上是同一周期:end_date等于下面的start_date(红色下划线)。有时经期甚至会重叠超过 1 个日期。

我想通过合并对应于相同时期的行来获得 "real periods"。

我试过的

def split_range(name):
    df_user = de_201512_echant[de_201512_echant.name == name]
    # -- Create a date_range with a length [min_start_date, max_start_date]
    t_date = pd.DataFrame(index=pd.date_range("2005-01-01", "2015-12-12").date)
    for row in range(0, df_user.shape[0]):
        start_date = df_user.iloc[row].start_date
        end_date = df_user.iloc[row].end_date
        if ((pd.isnull(start_date) == False) and (pd.isnull(end_date) == False)):
            t = pd.DataFrame(index=pd.date_range(start_date, end_date))
            t["period_%s" % (row)] = 1
            t_date = pd.merge(t_date, t, right_index=True, left_index=True, how="left")
        else:
            pass

    return t_date

这会产生一个 DataFrame,其中每个列都是一个句点(如果在范围内则为 1,否则为 NaN):

t_date
Out[29]: 
            period_0  period_1  period_2  period_3  period_4  period_5  \
2005-01-01       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-02       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-03       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-04       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-05       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-06       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-07       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-08       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-09       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-10       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-11       NaN       NaN       NaN       NaN       NaN       NaN  

然后,如果我对所有列(句点)求和,我几乎得到了我想要的结果:

full_spell = t_date.sum(axis=1)
full_spell.loc[full_spell == 1]

Out[31]: 
2005-11-14    1.0
2005-11-15    1.0
2005-11-16    1.0
2005-11-17    1.0
2005-11-18    1.0
2005-11-19    1.0
2005-11-20    1.0
2005-11-21    1.0
2005-11-22    1.0
2005-11-23    1.0
2005-11-24    1.0
2005-11-25    1.0
2005-11-26    1.0
2005-11-27    1.0
2005-11-28    1.0
2005-11-29    1.0
2005-11-30    1.0
2006-01-16    1.0
2006-01-17    1.0
2006-01-18    1.0
2006-01-19    1.0
2006-01-20    1.0
2006-01-21    1.0
2006-01-22    1.0
2006-01-23    1.0
2006-01-24    1.0
2006-01-25    1.0
2006-01-26    1.0
2006-01-27    1.0
2006-01-28    1.0

2015-07-06    1.0
2015-07-07    1.0
2015-07-08    1.0
2015-07-09    1.0
2015-07-10    1.0
2015-07-11    1.0
2015-07-12    1.0
2015-07-13    1.0
2015-07-14    1.0
2015-07-15    1.0
2015-07-16    1.0
2015-07-17    1.0
2015-07-18    1.0
2015-07-19    1.0
2015-08-02    1.0
2015-08-03    1.0
2015-08-04    1.0
2015-08-05    1.0
2015-08-06    1.0
2015-08-07    1.0
2015-08-08    1.0
2015-08-09    1.0
2015-08-10    1.0
2015-08-11    1.0
2015-08-12    1.0
2015-08-13    1.0
2015-08-14    1.0
2015-08-15    1.0
2015-08-16    1.0
2015-08-17    1.0
dtype: float64

但我找不到一种方法来分割这个稀疏日期时间索引的所有时间范围以最终获得我想要的输出:包含 "real" 时间段的原始数据帧。

这可能不是最有效的方法,所以如果您有其他选择,请不要犹豫!

我发现使用 apply:

更有效
 def get_range(row):
  '''returns a DataFrame containing the day-range from a "start_date"
  and a "end_date"'''
  start_date = row["start_date"]
  end_date = row["end_date"]
  period = pd.date_range(start_date, end_date, freq="1D")

  return pd.Dataframe(period, columns='days_in_period')

# -- Apply get_range() to the initial df
t_all = df.apply(get_range)
# -- Drop overlapping dates
t_all.drop_duplicates(inplace=True)