Apache Beam 每用户会话 windows 未合并
Apache Beam per-user session windows are unmerged
我们有一个有用户的应用;每个用户每次使用我们的应用大约 10-40 分钟,我想根据发生的特定事件(例如 "this user converted"、"this user had a problem last session", "this user had a successful last session").
(在此之后我想每天计算这些更高级别的事件,但这是一个单独的问题)
为此,我一直在研究会话 windows;但所有 docs 似乎都适合全局会话 windows,但我想为每个用户创建它们(这也是一种自然分区)。
我找不到有关如何执行此操作的文档(python 首选)。你能给我指出正确的方向吗?
或者换句话说:如何创建可以输出更多结构化(丰富)事件的每个用户每个会话 windows?
我有什么
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
_, x = element
logging.info(">>> Received %s %s with window=%s", x['jsonPayload']['value'], x['timestamp'], window)
yield element
def sum_by_event_type(user_session_events):
logging.debug("Received %i events: %s", len(user_session_events), user_session_events)
d = {}
for key, group in groupby(user_session_events, lambda e: e['jsonPayload']['value']):
d[key] = len(list(group))
logging.info("After counting: %s", d)
return d
# ...
by_user = valid \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['jsonPayload']['userId'], x))
session_gap = 5 * 60 # [s]; 5 minutes
user_sessions = by_user \
| 'user_session_window' >> beam.WindowInto(beam.window.Sessions(session_gap),
timestamp_combiner=beam.window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'debug_printer' >> beam.ParDo(DebugPrinter()) \
| beam.CombinePerKey(sum_by_event_type)
它输出什么
INFO:root:>>> Received event_1 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_2 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_3 2019-03-12T08:54:30.400Z with window=[1552380870.4, 1552381170.4)
INFO:root:>>> Received event_4 2019-03-12T08:54:36.300Z with window=[1552380876.3, 1552381176.3)
INFO:root:>>> Received event_5 2019-03-12T08:54:38.100Z with window=[1552380878.1, 1552381178.1)
如你所见; Session() window 不会扩展 Window,但只会将非常接近的事件组合在一起...哪里做错了?
您可以通过在 windowing 之后添加 Group By Key 转换来使其工作。您已将键分配给记录,但实际上并未按键将它们组合在一起,会话 windowing(按键工作)不知道这些事件需要合并在一起。
为了证实这一点,我用一些内存中的虚拟数据做了一个可重现的例子(将 Pub/Sub 与问题隔离开来并能够更快地对其进行测试)。所有五个事件将具有相同的密钥或 user_id
,但它们将 "arrive" 依次相隔 1、2、4 和 8 秒。当我使用 5 秒的 session_gap
时,我希望前 4 个元素合并到同一个会话中。第 5 个事件将在第 4 个事件之后花费 8 秒,因此它必须降级到下一个会话(超过 5 秒的差距)。数据是这样创建的:
data = [{'user_id': 'Thanos', 'value': 'event_{}'.format(event), 'timestamp': time.time() + 2**event} for event in range(5)]
我们使用 beam.Create(data)
初始化管道并使用 beam.window.TimestampedValue
分配 "fake" 时间戳。同样,我们只是用它来模拟流式传输行为。之后,我们通过 user_id
字段创建键值对,我们将 window 转换为 window.Sessions
,然后添加缺少的 beam.GroupByKey()
步骤。最后,我们使用稍作修改的 DebugPrinter
: 记录结果。管道现在看起来像这样:
events = (p
| 'Create Events' >> beam.Create(data) \
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['user_id'], x))
| 'user_session_window' >> beam.WindowInto(window.Sessions(session_gap),
timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'Group' >> beam.GroupByKey()
| 'debug_printer' >> beam.ParDo(DebugPrinter()))
其中 DebugPrinter
是:
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
for x in element[1]:
logging.info(">>> Received %s %s with window=%s", x['value'], x['timestamp'], window)
yield element
如果我们在不按键分组的情况下进行测试,我们会得到相同的行为:
INFO:root:>>> Received event_0 1554117323.0 with window=[1554117323.0, 1554117328.0)
INFO:root:>>> Received event_1 1554117324.0 with window=[1554117324.0, 1554117329.0)
INFO:root:>>> Received event_2 1554117326.0 with window=[1554117326.0, 1554117331.0)
INFO:root:>>> Received event_3 1554117330.0 with window=[1554117330.0, 1554117335.0)
INFO:root:>>> Received event_4 1554117338.0 with window=[1554117338.0, 1554117343.0)
但是在添加它之后,windows 现在可以正常工作了。事件 0 到 3 在一个扩展的 12 秒会话 window 中合并在一起。事件 4 属于单独的 5s 会话。
INFO:root:>>> Received event_0 1554118377.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_1 1554118378.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_3 1554118384.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_2 1554118380.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_4 1554118392.37 with window=[1554118392.37, 1554118397.37)
完整代码here
还有两件事值得一提。第一个是,即使 运行 在具有 DirectRunner 的单台机器中本地,记录也可以是无序的(在我的例子中,event_3 在 event_2 之前处理)。这样做是为了模拟记录在案的分布式处理 here。
最后一个是,如果你得到这样的堆栈跟踪:
TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running 'Write Results/Write/WriteImpl/WriteBundles']
从 2.10.0/2.11.0 SDK 降级到 2.9.0。例如,参见此 。
我们有一个有用户的应用;每个用户每次使用我们的应用大约 10-40 分钟,我想根据发生的特定事件(例如 "this user converted"、"this user had a problem last session", "this user had a successful last session").
(在此之后我想每天计算这些更高级别的事件,但这是一个单独的问题)
为此,我一直在研究会话 windows;但所有 docs 似乎都适合全局会话 windows,但我想为每个用户创建它们(这也是一种自然分区)。
我找不到有关如何执行此操作的文档(python 首选)。你能给我指出正确的方向吗?
或者换句话说:如何创建可以输出更多结构化(丰富)事件的每个用户每个会话 windows?
我有什么
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
_, x = element
logging.info(">>> Received %s %s with window=%s", x['jsonPayload']['value'], x['timestamp'], window)
yield element
def sum_by_event_type(user_session_events):
logging.debug("Received %i events: %s", len(user_session_events), user_session_events)
d = {}
for key, group in groupby(user_session_events, lambda e: e['jsonPayload']['value']):
d[key] = len(list(group))
logging.info("After counting: %s", d)
return d
# ...
by_user = valid \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['jsonPayload']['userId'], x))
session_gap = 5 * 60 # [s]; 5 minutes
user_sessions = by_user \
| 'user_session_window' >> beam.WindowInto(beam.window.Sessions(session_gap),
timestamp_combiner=beam.window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'debug_printer' >> beam.ParDo(DebugPrinter()) \
| beam.CombinePerKey(sum_by_event_type)
它输出什么
INFO:root:>>> Received event_1 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_2 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_3 2019-03-12T08:54:30.400Z with window=[1552380870.4, 1552381170.4)
INFO:root:>>> Received event_4 2019-03-12T08:54:36.300Z with window=[1552380876.3, 1552381176.3)
INFO:root:>>> Received event_5 2019-03-12T08:54:38.100Z with window=[1552380878.1, 1552381178.1)
如你所见; Session() window 不会扩展 Window,但只会将非常接近的事件组合在一起...哪里做错了?
您可以通过在 windowing 之后添加 Group By Key 转换来使其工作。您已将键分配给记录,但实际上并未按键将它们组合在一起,会话 windowing(按键工作)不知道这些事件需要合并在一起。
为了证实这一点,我用一些内存中的虚拟数据做了一个可重现的例子(将 Pub/Sub 与问题隔离开来并能够更快地对其进行测试)。所有五个事件将具有相同的密钥或 user_id
,但它们将 "arrive" 依次相隔 1、2、4 和 8 秒。当我使用 5 秒的 session_gap
时,我希望前 4 个元素合并到同一个会话中。第 5 个事件将在第 4 个事件之后花费 8 秒,因此它必须降级到下一个会话(超过 5 秒的差距)。数据是这样创建的:
data = [{'user_id': 'Thanos', 'value': 'event_{}'.format(event), 'timestamp': time.time() + 2**event} for event in range(5)]
我们使用 beam.Create(data)
初始化管道并使用 beam.window.TimestampedValue
分配 "fake" 时间戳。同样,我们只是用它来模拟流式传输行为。之后,我们通过 user_id
字段创建键值对,我们将 window 转换为 window.Sessions
,然后添加缺少的 beam.GroupByKey()
步骤。最后,我们使用稍作修改的 DebugPrinter
: 记录结果。管道现在看起来像这样:
events = (p
| 'Create Events' >> beam.Create(data) \
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['user_id'], x))
| 'user_session_window' >> beam.WindowInto(window.Sessions(session_gap),
timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'Group' >> beam.GroupByKey()
| 'debug_printer' >> beam.ParDo(DebugPrinter()))
其中 DebugPrinter
是:
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
for x in element[1]:
logging.info(">>> Received %s %s with window=%s", x['value'], x['timestamp'], window)
yield element
如果我们在不按键分组的情况下进行测试,我们会得到相同的行为:
INFO:root:>>> Received event_0 1554117323.0 with window=[1554117323.0, 1554117328.0)
INFO:root:>>> Received event_1 1554117324.0 with window=[1554117324.0, 1554117329.0)
INFO:root:>>> Received event_2 1554117326.0 with window=[1554117326.0, 1554117331.0)
INFO:root:>>> Received event_3 1554117330.0 with window=[1554117330.0, 1554117335.0)
INFO:root:>>> Received event_4 1554117338.0 with window=[1554117338.0, 1554117343.0)
但是在添加它之后,windows 现在可以正常工作了。事件 0 到 3 在一个扩展的 12 秒会话 window 中合并在一起。事件 4 属于单独的 5s 会话。
INFO:root:>>> Received event_0 1554118377.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_1 1554118378.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_3 1554118384.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_2 1554118380.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_4 1554118392.37 with window=[1554118392.37, 1554118397.37)
完整代码here
还有两件事值得一提。第一个是,即使 运行 在具有 DirectRunner 的单台机器中本地,记录也可以是无序的(在我的例子中,event_3 在 event_2 之前处理)。这样做是为了模拟记录在案的分布式处理 here。
最后一个是,如果你得到这样的堆栈跟踪:
TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running 'Write Results/Write/WriteImpl/WriteBundles']
从 2.10.0/2.11.0 SDK 降级到 2.9.0。例如,参见此