在 Google 趋势数据上使用简单模型来预测某些事情并不像预期的那样有效

Using simple models on Google Trends data to predict something doesn't work as expected

我正在使用 Google 趋势开发一个简单的模型来预测一组搜索词的未来趋势。我从 this blog post 中获得灵感,并尝试对其他搜索词做基本相同的事情,试图找到适合此类任务的最佳模型。


问题是:对其他搜索词的预测完全错误。我只使用具有规则模式的术语,有时不如博客示例中的模式规则。这是我改编的代码:

import numpy as np
import pandas as pd
from datetime import date
from matplotlib import pyplot as plt
from keras.models import Sequential
from keras.layers import InputLayer, Reshape, Conv1D, MaxPool1D, Flatten, Dense, LSTM
from keras.callbacks import EarlyStopping, ModelCheckpoint
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()



def prepare_data(target, window_X, window_y):
    """ Data preprocessing for multistep forecast """
    X, y = [], []
    start_X = 0
    end_X = start_X + window_X
    start_y = end_X
    end_y = start_y + window_y
    for _ in range(len(target)):
        if end_y < len(target):
            X.append(target[start_X:end_X])
            y.append(target[start_y:end_y])
        start_X += 1
        end_X = start_X + window_X
        start_y += 1
        end_y = start_y + window_y
    X = np.array(X)
    y = np.array(y)
    return np.array(X), np.array(y)


def fit_model(type, X_train, y_train, X_test, y_test, batch_size, epochs):
    """ Training function for network """
    # Model input
    model = Sequential()
    model.add(InputLayer(input_shape=(X_train.shape[1], )))

    if type == 'mlp':
        model.add(Reshape(target_shape=(X_train.shape[1], )))
        model.add(Dense(units=64, activation='relu'))

    if type == 'cnn':
        model.add(Reshape(target_shape=(X_train.shape[1], 1)))
        model.add(Conv1D(filters=64, kernel_size=4, activation='relu'))
        model.add(MaxPool1D())
        model.add(Flatten())

    if type == 'lstm':
        model.add(Reshape(target_shape=(X_train.shape[1], 1)))
        model.add(LSTM(units=64, return_sequences=False))

    # Output layer
    model.add(Dense(units=64, activation='relu'))
    model.add(Dense(units=y_train.shape[1], activation='sigmoid'))

    # Compile
    model.compile(optimizer='adam', loss='mse')

    # Callbacks
    early_stopping = EarlyStopping(monitor='val_loss', patience=10)
    model_checkpoint = ModelCheckpoint(filepath='model.h5', save_best_only=True)
    callbacks = [early_stopping, model_checkpoint]

    # Fit model
    model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test),
              batch_size=batch_size, epochs=epochs, callbacks=callbacks, verbose=2)

    # Load best model
    model.load_weights('model.h5')

    # Return
    return model


# Define windows
window_X = 12
window_y = 6

# Load data
data = pd.read_csv('data/holocaust-world.csv', sep=',')
data = data.set_index(keys=pd.to_datetime(data['month']), drop=True).drop('month', axis=1)

# Scale data
data['y'] = data['y'] / 100.

# Prepare tensors
X, y = prepare_data(target=data['y'].values, window_X=window_X, window_y=window_y)

# Training and test
train = 100
X_train = X[:train]
y_train = y[:train]
X_valid = X[train:]
y_valid = y[train:]

# Train models
models = ['mlp', 'cnn', 'lstm']

# Test data
X_test = data['y'].values[-window_X:].reshape(1, -1)

# Predictions
preds = pd.DataFrame({'mlp': [np.nan]*6, 'cnn': [np.nan]*6, 'lstm': [np.nan]*6})
preds = preds.set_index(pd.date_range(start=date(2018, 11, 1), end=date(2019, 4, 1), freq='MS'))

# Fit models and plot
for mod in models:

    # Train models
    model = fit_model(type=mod, X_train=X_train, y_train=y_train, X_test=X_valid, y_test=y_valid, batch_size=16, epochs=1000)

    # Predict
    p = model.predict(x=X_test)

    # Fill
    preds[mod] = p[0]

# Plot the entire timeline, including the predicted segment
idx = pd.date_range(start=date(2004, 1, 1), end=date(2019, 4, 1), freq='MS')
data = data.reindex(idx)
plt.plot(data['y'], label='Google')

# Plot
plt.plot(preds['mlp'], label='MLP')
plt.plot(preds['cnn'], label='CNN')
plt.plot(preds['lstm'], label='LSTM')
plt.legend()
plt.show()

这里我尝试评估了对大屠杀主题的兴趣,这也是周期性的(一月高峰,你可以从 Google Trends 网站上抓取 csv 很明显)。以下是结果:


所以问题是:

提前致谢!

增加您测试的预测数量,您应该会得到更好的结果。

window_y = 49
...
# Predictions
preds = pd.DataFrame({'mlp': [np.nan]*49, 'cnn': [np.nan]*49, 'lstm': [np.nan]*49})
preds = preds.set_index(pd.date_range(start=date(2015, 1, 1), end=date(2019, 1, 1), freq='MS'))

使用 training/test 套装也有帮助:

# Training and test
train = 50
X_train = X[:train]
y_train = y[:train]
X_valid = X[train:]
y_valid = y[train:]

然而,这种特定趋势是周期性的,但也会受到其他因素的影响。 Phrophet can help you dealing with this kind of trends 比简单的机器学习模型更好。