一天遍历时间序列

Question

大量编辑：

好的，所以我有一个分钟级别的时间序列数据框。例如，这个数据框是一年的数据。我正在尝试创建一个分析模型，该模型将每天迭代此数据。

函数将： 1）从数据框中切出一天的数据。 2) 创建每日切片的 30 分钟（一天的前 30 分钟）子切片。 3) 通过函数的分析部分传递来自两个切片的数据。 4）附加到新的数据框。 5) 继续迭代直到完成。

数据帧格式如下：

                           open_price high  low   close_price volume     price
2015-01-06 14:31:00+00:00   46.3800 46.440  46.29   46.380  560221.0    46.380
2015-01-06 14:32:00+00:00   46.3800 46.400  46.30   46.390  52959.0     46.390
2015-01-06 14:33:00+00:00   46.3900 46.495  46.36   46.470  100100.0    46.470
2015-01-06 14:34:00+00:00   46.4751 46.580  46.41   46.575  85615.0     46.575
2015-01-06 14:35:00+00:00   46.5800 46.610  46.53   46.537  78175.0     46.537

在我看来 pandas 日期时间索引功能是完成此任务的最佳方式，但我不知道从哪里开始。

(1) 似乎我可以使用 .rollforward 功能，从 df start date/time 开始，并在每次迭代中向前滚动一天。

(2) 使用 df.loc[掩码] 创建子切片。

我很确定我可以在 (2) 之后弄清楚，但我又一次不太熟悉时间序列分析或 pandas 日期时间索引功能。

最终数据帧：

              high     low   retrace  time
2015-01-06    46.440  46.29  True     47
2015-01-07    46.400  46.30  True     138
2015-01-08    46.495  46.36  False    NaN
2015-01-09    46.580  46.41  True     95
2015-01-10    46.610  46.53  False    NaN

高 = 当天前 30 分钟的高

低 = 当天前 30 分钟的低

Retrace = 布尔值，如果价格在前 30 分钟后的某个时间回到开盘价。

时间 = 回溯所用的时间（分钟）。

这是我的代码，似乎可以工作（感谢大家的帮助！）：

sample = msft_prices.ix[s_date:e_date]
sample = sample.resample('D').mean() 
sample = sample.dropna()
sample = sample.index.strftime('%Y-%m-%d')
ORTDF = pd.DataFrame()
ORDF = pd.DataFrame()
list1 = []
list2 = []
def hi_lo(prices):

        for i in sample:
            list1 = []
            if i in prices.index:

                ORTDF = prices[i+' 14:30':i+' 15:00']
                ORH = max(ORTDF['high']) #integer value
                ORHK = ORTDF['high'].idxmax()
                ORL = min(ORTDF['low']) #integer value
                ORLK = ORTDF['low'].idxmin()
                list1.append(ORH)
                list1.append(ORL)



                if ORHK < ORLK:
                    dailydf = prices[i+' 14:30':i+' 21:00']
                    if max(dailydf['high']) > ORH:
                        ORDH = max(dailydf['high'])
                        ORDHK = dailydf['high'].idxmax()
                        touched = 1
                        time_to_touch = ORDHK - ORHK
                        time_to_touch = time_to_touch.total_seconds() / 60
                        list1.append(touched)
                        list1.append(time_to_touch)
                        list2.append(list1)
                    else:
                        touched = 0
                        list1.append(touched)
                        list1.append('NaN')
                        list2.append(list1)
                elif ORHK > ORLK:
                    dailydf = prices[i+' 14:30':i+' 21:00']
                    if min(dailydf['low']) < ORL:
                        ORDL = min(dailydf['low'])
                        ORDLK = dailydf['low'].idxmin()
                        touched = 1
                        time_to_touch = ORDLK - ORLK
                        time_to_touch = time_to_touch.total_seconds() / 60
                        list1.append(touched)
                        list1.append(time_to_touch)
                        list2.append(list1)
                    else:
                        touched = 0
                        list1.append(touched)
                        list1.append('NaN')
                        list2.append(list1)


            else:
                pass


        ORDF = pd.DataFrame(list2, columns=['High', 'Low', 'Retraced', 'Time']).set_index([sample])
        return ORDF

这可能不是最优雅的方法，但是，嘿，它有效！

Answer 1

阅读 the docs 作为一般参考

设置（下次请自己在问题中提供！）：

dates = pd.to_datetime(['19 November 2010 9:01', '19 November 2010 9:02', '19 November 2010 9:03',
                       '20 November 2010 9:05', '20 November 2010 9:06', '20 November 2010 9:07'])
df = pd.DataFrame({'low_price': [1.2, 1.8, 1.21, 2., 4., 1.201],  
                  'high_price': [3., 1.8, 1.21, 4., 4.01, 1.201]}, index=dates)
df

                    high_price  low_price
2010-11-19 09:01:00     3.000   1.200
2010-11-19 09:02:00     1.800   1.800
2010-11-19 09:03:00     1.210   1.210
2010-11-20 09:05:00     4.000   2.000
2010-11-20 09:06:00     4.010   4.000
2010-11-20 09:07:00     1.201   1.201

我将按日分组，然后为每一天应用一个函数来计算是否存在回撤以及回撤发生的时间段。您的问题不清楚要操作哪个列或说的容忍度是多少"prices are the same"，所以我把它们作为选项

def retrace_per_day(day, col='high_price', epsilon=0.5):
    """take day data and returns whether there was a retrace.
    If yes, return 1 and the minute in which it did.
    Otherwise return 0 and np.nan"""
    cond = (np.abs(day[col] - day[col][0]) < epsilon)
    cond_index = cond[cond].index
    if len(cond_index) > 1:
        retrace, period = 1, cond_index[1]
    else:
        retrace, period = 0, np.nan
    return pd.Series({'retrace': retrace, 'period' : period})

df.groupby(pd.TimeGrouper('1D')).apply(retrace_per_day)

           period   retrace
2010-11-19  NaN     0.0
2010-11-20  2010-11-20 09:06:00     1.0

然后，如果需要，您可以使用它合并回原始数据框。

一天遍历时间序列

Iterating Through Timeseries One Day at a Time

python

time-series

dataframe

pandas

datetimeindex