我们可以在 Python 中聚类多元时间序列数据集吗

Can we cluster Multivariate Time Series dataset in Python

我有一个数据集,其中包含不同股票在不同 times.For 示例中的许多财务信号值

StockName  Date   Signal1  Signal2
----------------------------------
Stock1     1/1/20    a       b
Stock1     1/2/20    c       d
.
.
.
Stock2     1/1/20    e       f
Stock2     1/2/20    g       h
.
.
.

我想建立一个时间序列 table 如下所示,并基于 signal1 和 signal2(2 个变量)对股票进行聚类

StockName   1/1/20    1/2/20    ........    Cluster#
----------------------------------------------------
 Stock1     [a,b]      [c,d]                    0
 Stock2     [e,f]      [g,h]                    1
 Stock3     ......     .....                    0
 .
 .
 .

1)有什么方法可以做到吗? (基于时间序列数据的多个变量的聚类股票)。我试图在网上搜索但他们都是关于基于一个变量的聚类时间序列。

2)另外,有什么方法可以将不同时间的不同股票也聚集在一起吗? (所以也许 time1 的 Stock1 与 time3 的 Stock2 在同一个集群中)

我正在根据您上次发布的新信息修改我的答案。

from utils import *

import time
import numpy as np

from mxnet import nd, autograd, gluon
from mxnet.gluon import nn, rnn
import mxnet as mx
import datetime
import seaborn as sns
import matplotlib.pyplot as plt

# %matplotlib inline
from sklearn.decomposition import PCA

import math

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

import xgboost as xgb
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

context = mx.cpu(); model_ctx=mx.cpu()
mx.random.seed(1719)

# Note: The purpose of this section (3. The Data) is to show the data preprocessing and to give rationale for using different sources of data, hence I will only use a subset of the full data (that is used for training).

def parser(x):
    return datetime.datetime.strptime(x,'%Y-%m-%d')

# dataset_ex_df = pd.read_csv('data/panel_data_close.csv', header=0, parse_dates=[0], date_parser=parser)


import yfinance as yf

# Get the data for the stock AAPL
start = '2018-01-01'
end = '2020-04-22'

data = yf.download('GS', start, end)

data = data.reset_index()
data

    data.dtypes

    # re-name field from 'Adj Close' to 'Adj_Close'
    data = data.rename(columns={"Adj Close": "Adj_Close"})
    data

num_training_days = int(data.shape[0]*.7)
print('Number of training days: {}. Number of test days: {}.'.format(num_training_days, data.shape[0]-num_training_days))



# TECHNICAL INDICATORS
#def get_technical_indicators(dataset):
# Create 7 and 21 days Moving Average
data['ma7'] = data['Adj_Close'].rolling(window=7).mean()
data['ma21'] = data['Adj_Close'].rolling(window=21).mean()


# Create exponential weighted moving average
data['26ema'] = data['Adj_Close'].ewm(span=26).mean()
data['12ema'] = data['Adj_Close'].ewm(span=12).mean()
data['MACD'] = (data['12ema']-data['26ema'])

# Create Bollinger Bands
data['20sd'] = data['Adj_Close'].rolling(window=20).std() 
data['upper_band'] = data['ma21'] + (data['20sd']*2)
data['lower_band'] = data['ma21'] - (data['20sd']*2)

# Create Exponential moving average
data['ema'] = data['Adj_Close'].ewm(com=0.5).mean()

# Create Momentum
data['momentum'] = data['Adj_Close']-1



dataset_TI_df = data
dataset = data


def plot_technical_indicators(dataset, last_days):
    plt.figure(figsize=(16, 10), dpi=100)
    shape_0 = dataset.shape[0]
    xmacd_ = shape_0-last_days

    dataset = dataset.iloc[-last_days:, :]
    x_ = range(3, dataset.shape[0])
    x_ =list(dataset.index)

    # Plot first subplot
    plt.subplot(2, 1, 1)
    plt.plot(dataset['ma7'],label='MA 7', color='g',linestyle='--')
    plt.plot(dataset['Adj_Close'],label='Closing Price', color='b')
    plt.plot(dataset['ma21'],label='MA 21', color='r',linestyle='--')
    plt.plot(dataset['upper_band'],label='Upper Band', color='c')
    plt.plot(dataset['lower_band'],label='Lower Band', color='c')
    plt.fill_between(x_, dataset['lower_band'], dataset['upper_band'], alpha=0.35)
    plt.title('Technical indicators for Goldman Sachs - last {} days.'.format(last_days))
    plt.ylabel('USD')
    plt.legend()

    # Plot second subplot
    plt.subplot(2, 1, 2)
    plt.title('MACD')
    plt.plot(dataset['MACD'],label='MACD', linestyle='-.')
    plt.hlines(15, xmacd_, shape_0, colors='g', linestyles='--')
    plt.hlines(-15, xmacd_, shape_0, colors='g', linestyles='--')
    # plt.plot(dataset['log_momentum'],label='Momentum', color='b',linestyle='-')

    plt.legend()
    plt.show()

plot_technical_indicators(dataset_TI_df, 400)

这将为您提供一些工作信号。当然,这些功能可以是您想要的任何东西。我相信您知道这是技术分析,而不是基本面分析。现在,您可以进行聚类,以及您想要的任何其他操作。

这是一个很好的 link 集群。

https://www.pythonforfinance.net/2018/02/08/stock-clusters-using-k-means-algorithm-in-python/

好material阅读(标题:时间序列聚类和降维)

https://towardsdatascience.com/time-series-clustering-and-dimensionality-reduction-5b3b4e84f6a3