如何设计一个 AWS IoT Analytics Pipeline 为每个设备提供单独的数据集?
How to design an AWS IoT Analytics Pipeline that will have separate data-set for each device?
我有一个移动应用程序可以从传感器获取数据并将这些数据推送到 AWS IoT Core 主题。我想将此数据中继到 AWS IoT Analytics,然后使用我自己的机器学习代码(使用容器数据集)对其进行分析。重要的是要确保事件被 device_id
隔离和批处理,并在 30 分钟时间 - windows 内进行分析。就我而言,只有一起分析由同一 device_id 生成的一组事件才有意义。事件负载已经包含唯一的 device_id 属性。想到的第一个解决方案是为每个移动客户端单独设置 Channel -> Pipeline -> DataStore -> SQL DataSet -> Container Data Set
。视觉上看起来像这样:
给定设备数量为 N,这种架构的问题是我需要有 N 个通道,N 个实际上相同的管道,N 个存储相同 type/schema 数据的数据存储,最后是 2*N 个数据集.因此,如果我有 50,000 台设备,资源的数量将是巨大的。这让我意识到这不是一个好的解决方案。
我想到的下一个想法是所有设备只有一个通道、一个管道和一个数据存储,并且每个设备只有不同的 SQL 数据集和不同的容器数据集。看起来像这样:
这种架构现在感觉好多了,但如果我有 50,000 个设备,我仍然需要 100,000 个不同的数据集。默认的 AWS 限制是每个账户 100 个数据集。当然我可以请求增加限制,但是如果默认限制是 100 个数据集,那么我想知道请求增加限制是否有意义,即默认限制的 x1000 倍? AWS IoT Analytics 应该如何使用这两种架构中的任何一种,还是我遗漏了什么?
我在 AWS Forum 上发布了同样的问题,并从在那里工作的工程师那里得到了有用的答案。我在这里发布他的答案,供那些可能像我一样有类似架构要求的人使用:
I don't think a dataset per user is the right way to model this. The way we'd recommend the data architecture would be to use a single dataset (or maybe a small number of datasets pivoted by device type, country or other higher level grouping) and have a SQL query that extracts data for the time period of interest, 30 minutes in your case. Next you trigger a container dataset that consumes the dataset and prepares the final analysis you need per user.
The notebook would basically iterate over every unique customer id (you may have been able to do grouping and ordering in the SQL to make this faster) and perform the analysis you need before sending that data where needed. You could have 1 container dataset to do the initial data processing per customer and a second container dataset to do the ML training depending on the complexity of the scenario, but for many cases a single container dataset will be fine - I've used this approach to train tens of thousands of individual 'devices' so this may also work for your use case.
我有一个移动应用程序可以从传感器获取数据并将这些数据推送到 AWS IoT Core 主题。我想将此数据中继到 AWS IoT Analytics,然后使用我自己的机器学习代码(使用容器数据集)对其进行分析。重要的是要确保事件被 device_id
隔离和批处理,并在 30 分钟时间 - windows 内进行分析。就我而言,只有一起分析由同一 device_id 生成的一组事件才有意义。事件负载已经包含唯一的 device_id 属性。想到的第一个解决方案是为每个移动客户端单独设置 Channel -> Pipeline -> DataStore -> SQL DataSet -> Container Data Set
。视觉上看起来像这样:
我在 AWS Forum 上发布了同样的问题,并从在那里工作的工程师那里得到了有用的答案。我在这里发布他的答案,供那些可能像我一样有类似架构要求的人使用:
I don't think a dataset per user is the right way to model this. The way we'd recommend the data architecture would be to use a single dataset (or maybe a small number of datasets pivoted by device type, country or other higher level grouping) and have a SQL query that extracts data for the time period of interest, 30 minutes in your case. Next you trigger a container dataset that consumes the dataset and prepares the final analysis you need per user. The notebook would basically iterate over every unique customer id (you may have been able to do grouping and ordering in the SQL to make this faster) and perform the analysis you need before sending that data where needed. You could have 1 container dataset to do the initial data processing per customer and a second container dataset to do the ML training depending on the complexity of the scenario, but for many cases a single container dataset will be fine - I've used this approach to train tens of thousands of individual 'devices' so this may also work for your use case.