时序数据的SWAB分割算法

Question

我试图了解如何对一组时间序列数据（每日股票价格、温度等）进行细分，并偶然发现一本解释如何进行 SWAB 的书（滑动-window和自下而上）分割算法，但我不是很明白。这种分割是超声处理算法的一部分。以下文字来自"Multimedia Data Mining and Analytics: Disruptive Innovation".

The SWAB segmentation algorithm gets four parameters—the input file (time series data), the output file (segmented data), the maximal error, and the indication of nominal attributes. After running a number of experiments on time series of different sizes with different values for the number of segments, we chose the appropriate default number of segments as follows. 25–50 % of time series size for time series with less than 100 observations, 20–35 % for time series with 100–200 observations, and 15–25 % for time series with more than 200 observations. If the user is not interested to use the default value for any reason, he can enter his own number of segments as a parameter to the algorithm. Starting with the default values for the minimum and the maximum error, we run the segmentation algorithm for the first time and get the minimum number of segments for a given time series (the higher the maximum error, the fewer segments will be found). Then we decrease the maximum error (and so increase the number of found segments) trying to narrow the upper and the lower bounds of error by dividing the base by powers of 2 (like in binary search). Every time after running the segmentation algorithm with the current maximal error, we test whether this value gives a better approximation for the optimal number of segments, and so is a better upper or lower bound for the optimal maximum error. If so, we advance the appropriate bound to this value. In the beginning, only the upper bound is affected. However, once we found the lower bound that provides more segments than the optimum, we continue to look for the optimal number of segments by smaller steps: the next maximum error is the mean between the current upper and lower bounds. As follows from our experience with many different time series databases, the optimal maximal error is usually found within 3–4 iterations. The convergence rate depends on the input time series database itself. If the algorithm has not converged within 20 iterations, we stop searching and proceed with the next sonification steps using the segments found at the 20th iteration.

例如，如果我有包含 150 个观测值的时间序列数据（对应于 20-35% 的默认分段数），我需要采取哪些具体步骤来对数据进行分段？

感谢任何帮助，谢谢。

Answer 1

精确步骤

这里是对方法的简要描述：

The Sliding Window algorithm works by anchoring the left point of a potential segment at the first data point of a time series, then attempting to approximate the data to the right with increasing longer segments. At some point i, the error for the potential segment is greater than the user-specified threshold, so the subsequence from the anchor to i -1 is transformed into a segment. The anchor is moved to location i, and the process repeats until the entire time series has been transformed into a piecewise linear approximation.

基于此，该算法的伪代码如下。请参阅我在代码中的评论，了解具体情况。

//function takes a set of points T and a max error
function Sliding_Window(T, max_error)
  anchor = 1;
  while (not finished segmenting time series) {
    i=2;

    //keep making subsets of progressively larger size
    //until the error of the subset is greater than the max error
    //t[anchor: anchor + i] represents the elements of the set
    //from index (anchor) to index (anchor + i)
    //this could be an implemented as an array
    while (calculate_error(T[anchor: anchor+i]) < max_error) { 
      i=i+1;
    }

    //add the newly found segment to the set of all segments
    Seg_TS = concat(Seg_TS, create_segment(T[anchor: anchor + (i-1)]);

    //and increment the anchor in preparation for creating a new segment
    anchor = anchor + i;
  }
}

"Error"

的定义

您似乎不清楚的一件事是 "error" 在这种情况下的含义。下面的段落很好地解释了它：

All segmentation algorithms also need some method to evaluate the quality of fit for a potential segment. A measure commonly used in conjunction with linear regression is the sum of squares, or the residual error. This is calculated by taking all the vertical differences between the best-fit line and the actual data points, squaring them and then summing them together. Another commonly used measure of goodness of fit is the distance between the best fit line and the data point furthest away in the vertical direction.

也就是说，这里可以用不止一种方法来表示"error"。统计中常用的两个是平方和和最大垂直距离。从理论上讲，您甚至可以为此编写自己的函数，只要它返回一个数字，该数字在某种程度上表明该段代表给定点集的程度。

有关平方和法的更多信息，请参见此处：https://en.wikipedia.org/wiki/Residual_sum_of_squares

如果您想自己实现，一些伪代码可能如下所示：

function calculateSegmentErrorUsingSumOfSquares() {
  int sum = 0;
  for each (point in set_approximated_by_segment) {
    int difference = point.y_coordinate - approximation_segment.y_at_x(point.x_coordinate)
    sum = sum + (difference * difference)
  }
  return sum
}

请注意，您使用的任何方法都可能具有某些优点和缺点。请参阅下面 Jason 的评论以获取更多信息和参考，但关键是：确保您选择的任何错误函数都能很好地响应您期望的数据类型。

来源

https://www.cs.rutgers.edu/~pazzani/Publications/survey.pdf

时序数据的SWAB分割算法

SWAB segmentation algorithm on time series data

algorithm

analytics

time-series

data-analysis