表示由连接复制的数据工作室条目

Question

我正在开发一个构建 ETL 流程和仪表板以控制某些 KPI 指标的项目。我在 BigQuery 中创建了一个 table，每月一次，我保存一些通过聚合从其他 table 中提取的数据计算出的键值。我正在测量员工发送的电子邮件，因此为了计算这些关键值之一，我需要从两个不同的 table 中读取并执行左连接以匹配，来自聚合中存在的每个公司工作区域(左table), 该区域有多少员工(右加入).

这是我的 table 的简化：

已发送电子邮件，按地区分组

|  Area Id  |  Service  |  Bad employees  | ...
|     1     |   Gmail   |      3416       | ...
|     2     |   Gmail   |     10782       | ...
|     2     |   Groups  |      9267       | ...

员工总数，按地区分组

|  Area Id  |  Total employees  | ...
|     1     |       34124       | ...
|     2     |       82561       | ...
|     3     |       49472       | ...

问题来了：如您所见，第一个table（已发送的电子邮件）有一个字段没有出现在第二个；我说的是 Service。出于这个原因，当我加入两个 tables 时，我将获得 Total employees 字段的重复值：

已加入table

|  Area Id  |  Service  |  Bad employees  |  Total employees  |
|     1     |   Gmail   |      3416       |       34124       |
|     2     |   Gmail   |     10782       |       82561       |
|     2     |   Groups  |      9267       |       82561       |

最后的 table 将用于在 Data Studio 中创建报告。我想在我的最终 table 中保留 Service 字段，因为我想让用户可以选择按它进行过滤。我无法编辑员工 table 架构并向其条目添加 Service 字段，因为该信息在电子邮件 table 中是唯一的，它代表发送电子邮件的服务与员工无关 table.

我正在努力为这个问题找到一个有效的数据建模选项；如果使用此解决方案并且我想在 Data Studio 上表示，比方说，Total number of employees per selected areas，我将得到包含多个服务的那些区域的错误值：

区域1员工总数： 34.124
区域 2 员工总数： 82.561 + 82.561 = 165.122
员工总数： 34.124 + 165.122 = 199.246
预期值：34.124 + 82.561 = 116.685

这将影响使用员工总数值的任何指标。

如何保留我加入的 table 的 Service 字段并在数据洞察中仍然代表 Total employees 的正确值？

Answer 1

可以通过将 Total employees 平均分配给每个区域的服务来解决这个问题。

“已发送电子邮件”数据集必须再次包含在混合数据中。唯一的连接键是 Area Id 并添加为度量字段 areas in emails 记录计数。
在图表中添加一个带有 Total employees / areas in emails

Answer 2

我通过使用嵌套和重复字段解决了这个问题。我认为 Data Studio 无法按重复字段内的值进行过滤，但我已经检查过它是可能的，因此非常适合我的用例。

已加入 table 架构：

[
  {
    "mode": "REQUIRED",
    "name": "id",
    "type": "INTEGER"
  },
  { 
    "mode": "REPEATED",
    "name": "service",
    "type": "RECORD",
    "fields": [
      {
        "mode": "NULLABLE",
        "name": "name",
        "type": "STRING"
      },
      {
        "mode": "NULLABLE",
        "name": "bad_employees",
        "type": "INTEGER",
      }
    ]
  },
  {
    "mode": "NULLABLE",
    "name": "total_employees",
    "type": "INTEGER",
    "description": "Sum of the emails sent during off hours for all the sources"
  },
]

加入table表示：

|    id     |  service.name  |  service.bad_employees  |  total_employees  |
|     1     |     Gmail      |          3416           |       34124       |
|     2     |     Gmail      |         10782           |       82561       |
|           |     Groups     |          9267           |                   |

这样，我可以通过执行 SUM(service.bad_employees) 获得 bad_employees 的正确总和，并使用 SUM(total_employees).

获得 total_employees 的正确值

此外，如果我只想按特定服务进行过滤，我可以在字段 service.name 上添加一个控件，它会正确过滤。

表示由连接复制的数据工作室条目

Represent on data studio entries duplicated by joins

join

data-modeling

left-join

google-data-studio