如何使用 python 中的新 datasets/datafarmes 更新经过训练的 IsolationForest 模型?
How can update trained IsolationForest model with new datasets/datafarmes in python?
假设我适合 IsolationForest()
algorithm from scikit-learn on time-series based Dataset1 or dataframe1 df1
and save the model using the methods mentioned & here。现在我想为 new dataset2 或 df2
.
更新我的模型
我的发现:
- 这个 workaround about Incremental learning 来自 sklearn:
...learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time, there will be only a small amount of instances in the main memory. Choosing a good size for the mini-batch that balances relevancy and memory footprint could involve tuning.
可惜IF算法不支持estimator.partial_fit(newdf)
- auto-sklearn 提供
refit()
is also not suitable for my case based on this post。
如何使用新的数据集 2 更新在数据集 1 上训练并保存的 IF 模型?
您可以简单地对新数据重用 .fit()
调用 available to the estimator。
这将是首选,尤其是在时间序列中,因为信号会发生变化并且您不希望较旧的 non-representative 数据被理解为可能正常(或异常)。
如果旧数据很重要,您可以简单地将较旧的训练数据和较新的输入信号数据合并在一起,然后再次调用 .fit()
。
还有旁注,according to sklearn documentation, it is better to use joblib
比 pickle
MRE 资源如下:
# Model
from sklearn.ensemble import IsolationForest
# Saving file
import joblib
# Data
import numpy as np
# Create a new model
model = IsolationForest()
# Generate some old data
df1 = np.random.randint(1,100,(100,10))
# Train the model
model.fit(df1)
# Save it off
joblib.dump(model, 'isf_model.joblib')
# Load the model
model = joblib.load('isf_model.joblib')
# Generate new data
df2 = np.random.randint(1,500,(1000,10))
# If the original data is now not important, I can just call .fit() again.
# If you are using time-series based data, this is preferred, as older data may not be representative of the current state
model.fit(df2)
# If the original data is important, I can simply join the old data to new data. There are multiple options for this:
# Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
# Numpy: https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html
combined_data = np.concatenate((df1, df2))
model.fit(combined_data)
假设我适合 IsolationForest()
algorithm from scikit-learn on time-series based Dataset1 or dataframe1 df1
and save the model using the methods mentioned df2
.
我的发现:
- 这个 workaround about Incremental learning 来自 sklearn:
...learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time, there will be only a small amount of instances in the main memory. Choosing a good size for the mini-batch that balances relevancy and memory footprint could involve tuning.
可惜IF算法不支持estimator.partial_fit(newdf)
- auto-sklearn 提供
refit()
is also not suitable for my case based on this post。
如何使用新的数据集 2 更新在数据集 1 上训练并保存的 IF 模型?
您可以简单地对新数据重用 .fit()
调用 available to the estimator。
这将是首选,尤其是在时间序列中,因为信号会发生变化并且您不希望较旧的 non-representative 数据被理解为可能正常(或异常)。
如果旧数据很重要,您可以简单地将较旧的训练数据和较新的输入信号数据合并在一起,然后再次调用 .fit()
。
还有旁注,according to sklearn documentation, it is better to use joblib
比 pickle
MRE 资源如下:
# Model
from sklearn.ensemble import IsolationForest
# Saving file
import joblib
# Data
import numpy as np
# Create a new model
model = IsolationForest()
# Generate some old data
df1 = np.random.randint(1,100,(100,10))
# Train the model
model.fit(df1)
# Save it off
joblib.dump(model, 'isf_model.joblib')
# Load the model
model = joblib.load('isf_model.joblib')
# Generate new data
df2 = np.random.randint(1,500,(1000,10))
# If the original data is now not important, I can just call .fit() again.
# If you are using time-series based data, this is preferred, as older data may not be representative of the current state
model.fit(df2)
# If the original data is important, I can simply join the old data to new data. There are multiple options for this:
# Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
# Numpy: https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html
combined_data = np.concatenate((df1, df2))
model.fit(combined_data)