交替时间段的机器学习

Question

我有一个多项式回归脚本可以正确预测 X 轴和 Y 轴的值，在我的示例中我使用 CPU 消费，下面我们看到数据集的示例：

其中time代表采集时间，例子：

1 = 1 minute
2 = 2 minute

等等...

而consume是cpu那一分钟的使用值，总结这个数据集展示了主机在30分钟时间段内的行为，每个值对应一分钟升序（1 分钟、2 分钟、3 分钟 ...）

结果是：

用这个算法：

# -*- coding: utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)

# Visualizing the Polymonial Regression results
def viz_polymonial():
    plt.scatter(X, y, color='red')
    plt.plot(X, pol_reg.predict(poly_reg.fit_transform(X)), color='blue')
    plt.title('Polynomial Regression for CPU')
    plt.xlabel('Time range')
    plt.ylabel('Consume')
    plt.show()
    return
viz_polymonial()

# 20 = time
print(pol_reg.predict(poly_reg.fit_transform([[20]])))

有什么问题？

如果我们复制此数据集，使 30 分钟范围出现 2 倍，则算法无法理解数据集，其结果效率不高，数据集示例：

--> 最多 time = 30 --> 最多 time = 30

Complete data set

注意：如果它有60个值，其中每30个值代表30分钟的范围，就好像是不同的采集日。

显示的结果是这样的：

Objective: 我希望代表多项式回归的蓝线与第一个结果图像相似，我们在上面看到的那个演示了一个循环，点连起来的地方，就好像算法失败了。

Research source

Answer 1

问题在于，在第二种情况下，您使用 X = 1, 2, ... 30, 1, 2, ... 30 进行绘图。绘图函数连接连续的点。如果您只是使用 pyplot 绘制散点图，您会看到漂亮的回归曲线。或者你可以 argsort。这是绿色散点图和黑色 argsort 行的代码。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression

# Importing the dataset
# dataset = pd.read_csv('data.csv')
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)

# Visualizing the Polymonial Regression results
def viz_polymonial():
    plt.scatter(X, y, color='red')
    indices = np.argsort(X[:, 0])
    plt.scatter(X, pol_reg.predict(poly_reg.fit_transform(X)), color='green')
    plt.plot(X[indices], pol_reg.predict(poly_reg.fit_transform(X))[indices], color='black')
    plt.title('Polynomial Regression for CPU')
    plt.xlabel('Time range')
    plt.ylabel('Consume')
    plt.show()
    return
viz_polymonial()

# 20 = time
print(pol_reg.predict(poly_reg.fit_transform([[20]])))

这是较大数据集的输出图像。

交替时间段的机器学习

Machine learning for alternate time periods

python

regression

machine-learning

polynomials

sklearn-pandas