我没有成功保存(序列化)带有 Scikit-Learn 和 MLeap 的 zip 文件 Python
I don't succeed to save (serialize) a zip file with Scikit-Learn with MLeap in Python
我试过了:
#Generate data
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 5), columns=['a', 'b', 'c', 'd', 'e'])
df["y"] = (df['a'] > 0.5).astype(int)
df.head()
from mleap.sklearn.ensemble.forest import RandomForestClassifier
forestModel = RandomForestClassifier()
forestModel.mlinit(input_features='a',
feature_names='a',
prediction_column='e_binary')
forestModel.fit(df[['a']], df[['y']])
forestModel.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleaptestmodelforestpysparkzip", "randomforest.zip")
我收到这个错误:
No such file or directory: 'jar:file:/dbfs/FileStore/tables/mleaptestmodelforestpysparkzip/randomforest.zip.node'
我也试过了:forestModel.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleaptestmodelforestpysparkzip/randomforest.zip")
并收到一条错误消息,指出缺少 "model_name" 属性。
你能帮帮我吗?
我添加了我尝试做的所有事情和我得到的结果:
Zip 管道:
1.
pipeline.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest")
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/model.json'
2.
pipeline.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)
FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'
3.
pipeline.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)
并创建“/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest”
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'
4.
pipeline.serialize_to_bundle("/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)
=> FileNotFoundError: [Errno 2] 没有这样的文件或目录: '/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'
5.
pipeline.serialize_to_bundle("/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)
=> OSError: [Errno 95] 不支持操作 - 但保存一些东西
pipeline.serialize_to_bundle("jar:dbfs:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)
=> FileNotFoundError: [Errno 2] 没有这样的文件或目录: 'jar:dbfs:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest
7.
pipeline.serialize_to_bundle("jar:dbfs:/FileStore/tables/lifttruck_mleap/pipeline_zip2/1/model.zip", model_name="forest", init=True)
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:dbfs:/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'
8.
pipeline.serialize_to_bundle("dbfs:/FileStore/tables/lifttruck_mleap/pipeline_zip2/1/model.zip", model_name="forest", init=True)
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'dbfs:/FileStore/tables/mleap/pipeline_zip2/1/model.zip/forest'
要压缩的模型
forest.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1/model.zip", model_name="forest")
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1/model.zip/forest.node'
forest.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1", model_name="model.zip")
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1/model.zip.node'
forest.serialize_to_bundle("/dbfs/FileStore/tables/mleap/random_forest_zip/1", model_name="model.zip")
=> 不要保存 zip。保存一个捆绑包。
我找到了问题和解决方法。
无法再使用 Databricks 进行随机写入,如此处所述:https://docs.databricks.com/data/databricks-file-system.html?_ga=2.197884399.1151871582.1592826411-509486897.1589442523#local-file-apis
解决方法是在本地文件系统中写入 zip 文件,然后将其复制到 DBFS 中。所以:
- 使用“init=True”在管道中序列化您的模型,将其保存在本地目录中
- 使用“dbutils.fs.cp(source, destination)”将其复制到您的数据湖
dbutils.fs.cp(来源、目的地)
我试过了:
#Generate data
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 5), columns=['a', 'b', 'c', 'd', 'e'])
df["y"] = (df['a'] > 0.5).astype(int)
df.head()
from mleap.sklearn.ensemble.forest import RandomForestClassifier
forestModel = RandomForestClassifier()
forestModel.mlinit(input_features='a',
feature_names='a',
prediction_column='e_binary')
forestModel.fit(df[['a']], df[['y']])
forestModel.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleaptestmodelforestpysparkzip", "randomforest.zip")
我收到这个错误:
No such file or directory: 'jar:file:/dbfs/FileStore/tables/mleaptestmodelforestpysparkzip/randomforest.zip.node'
我也试过了:forestModel.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleaptestmodelforestpysparkzip/randomforest.zip")
并收到一条错误消息,指出缺少 "model_name" 属性。
你能帮帮我吗?
我添加了我尝试做的所有事情和我得到的结果:
Zip 管道:
1.
pipeline.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest")
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/model.json'
2.
pipeline.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)
FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'
3.
pipeline.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)
并创建“/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest”
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'
4.
pipeline.serialize_to_bundle("/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)
=> FileNotFoundError: [Errno 2] 没有这样的文件或目录: '/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'
5.
pipeline.serialize_to_bundle("/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)
=> OSError: [Errno 95] 不支持操作 - 但保存一些东西
pipeline.serialize_to_bundle("jar:dbfs:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)
=> FileNotFoundError: [Errno 2] 没有这样的文件或目录: 'jar:dbfs:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest
7.
pipeline.serialize_to_bundle("jar:dbfs:/FileStore/tables/lifttruck_mleap/pipeline_zip2/1/model.zip", model_name="forest", init=True)
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:dbfs:/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'
8.
pipeline.serialize_to_bundle("dbfs:/FileStore/tables/lifttruck_mleap/pipeline_zip2/1/model.zip", model_name="forest", init=True)
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'dbfs:/FileStore/tables/mleap/pipeline_zip2/1/model.zip/forest'
要压缩的模型
forest.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1/model.zip", model_name="forest")
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1/model.zip/forest.node'
forest.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1", model_name="model.zip")
=> FileNotFoundError: [Errno 2] 没有那个文件或目录: 'jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1/model.zip.node'
forest.serialize_to_bundle("/dbfs/FileStore/tables/mleap/random_forest_zip/1", model_name="model.zip")
=> 不要保存 zip。保存一个捆绑包。
我找到了问题和解决方法。
无法再使用 Databricks 进行随机写入,如此处所述:https://docs.databricks.com/data/databricks-file-system.html?_ga=2.197884399.1151871582.1592826411-509486897.1589442523#local-file-apis
解决方法是在本地文件系统中写入 zip 文件,然后将其复制到 DBFS 中。所以:
- 使用“init=True”在管道中序列化您的模型,将其保存在本地目录中
- 使用“dbutils.fs.cp(source, destination)”将其复制到您的数据湖
dbutils.fs.cp(来源、目的地)