我们可以通过 python 将数据附加到存储在 Azure blob 存储中的现有 csv 文件吗？

Question

我在 azure designer studio 中部署了一个机器学习模型。我需要每天通过 python 代码用新数据重新训练它。我需要将现有的 csv 数据保留在 blob 存储中，并向现有的 csv 添加更多数据并重新训练它。如果我只用新数据重新训练模型，旧数据就会丢失，所以我需要通过将新数据附加到现有数据来重新训练模型。有什么办法可以通过 python 编码实现吗？

我也研究过附加 blob，但它们只在 blob 的末尾添加。在文档中，他们提到我们无法更新或添加到现有的 blob。

Answer 1

我不确定为什么它必须是一个 csv 文件。有许多基于 Python 的库可用于处理分布在多个 csvs 中的数据集。

在所有示例中，您传递一个 glob pattern，它将匹配多个文件。此模式可以非常自然地与 Azure ML 数据集配合使用，您可以将其用作输入。请参阅上面文档 link 的摘录。

from azureml.core import Workspace, Datastore, Dataset

datastore_name = 'your datastore name'

# get existing workspace
workspace = Workspace.from_config()
    
# retrieve an existing datastore in the workspace by name
datastore = Datastore.get(workspace, datastore_name)

# create a TabularDataset from 3 file paths in datastore
datastore_paths = [(datastore, 'weather/2018/11.csv'),
                   (datastore, 'weather/2018/12.csv'),
                   (datastore, 'weather/2019/*.csv')] # here's the glob pattern

weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

假设所有的 csvs 都可以放入内存，您可以轻松地将这些数据集转换为 pandas 数据帧。 with Azure ML Datasets,你叫

# get the input dataset by name
dataset = Dataset.get_by_name(ws, name=dataset_name)
# load the TabularDataset to pandas DataFrame
df = dataset.to_pandas_dataframe()

有了 Dask Dataframe，this GitHub issue 说你可以调用

df = my_dask_df.compute()

就输出数据集而言，您可以通过将输出 CSV 作为数据帧读取、向其附加数据然后将其覆盖到同一位置来控制它。

我们可以通过 python 将数据附加到存储在 Azure blob 存储中的现有 csv 文件吗？

Can we append data to an existing csv file stored in Azure blob storage through python?

azure-machine-learning-studio

azure-blob-storage