Pyspark 加入函数和时间戳之间的区别
Pyspark join with functions and difference between timestamps
我正在尝试将 2 个表与用户事件连接起来。我想通过 user_id (id) 加入 table_a 和 table_b 并且差异时间戳小于 5s (5000ms)。
这是我正在做的事情:
table_a = (
table_a
.join(
table_b,
table_a.uid == table_b.uid
& abs(table_b.b_timestamp - table_a.a_timestamp) < 5000
& table_a.a_timestamp.isNotNull()
,
how = 'left'
)
)
我收到 2 个错误:
错误 1)
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
如果我删除连接的第二个条件并仅保留第一个和第三个条件,则会出现错误 2:
org.apache.spark.sql.AnalysisException: cannot resolve '(
uidAND (
a_timestampIS NOT NULL))' due to data type mismatch: differing types in '(
uidAND (
a_timestampIS NOT NULL))' (string and boolean).;;
非常感谢任何帮助!
您只需要在每个过滤条件周围加上括号。例如以下作品:
df1 = spark.createDataFrame([
(1, 20),
(1, 21),
(1, 25),
(1, 30),
(2, 21),
], ['id', 'val'])
df2 = spark.createDataFrame([
(1, 21),
(2, 30),
], ['id', 'val'])
df1.join(
df2,
(df1.id == df2.id)
& (abs(df1.val - df2.val) < 5)
).show()
# +---+---+---+---+
# | id|val| id|val|
# +---+---+---+---+
# | 1| 20| 1| 21|
# | 1| 21| 1| 21|
# | 1| 25| 1| 21|
# +---+---+---+---+
但没有括号:
df1.join(
df2,
df1.id == df2.id
& abs(df1.val - df2.val) < 5
).show()
# ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
我正在尝试将 2 个表与用户事件连接起来。我想通过 user_id (id) 加入 table_a 和 table_b 并且差异时间戳小于 5s (5000ms)。
这是我正在做的事情:
table_a = (
table_a
.join(
table_b,
table_a.uid == table_b.uid
& abs(table_b.b_timestamp - table_a.a_timestamp) < 5000
& table_a.a_timestamp.isNotNull()
,
how = 'left'
)
)
我收到 2 个错误:
错误 1)
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
如果我删除连接的第二个条件并仅保留第一个和第三个条件,则会出现错误 2:
org.apache.spark.sql.AnalysisException: cannot resolve '(
uidAND (
a_timestampIS NOT NULL))' due to data type mismatch: differing types in '(
uidAND (
a_timestampIS NOT NULL))' (string and boolean).;;
非常感谢任何帮助!
您只需要在每个过滤条件周围加上括号。例如以下作品:
df1 = spark.createDataFrame([
(1, 20),
(1, 21),
(1, 25),
(1, 30),
(2, 21),
], ['id', 'val'])
df2 = spark.createDataFrame([
(1, 21),
(2, 30),
], ['id', 'val'])
df1.join(
df2,
(df1.id == df2.id)
& (abs(df1.val - df2.val) < 5)
).show()
# +---+---+---+---+
# | id|val| id|val|
# +---+---+---+---+
# | 1| 20| 1| 21|
# | 1| 21| 1| 21|
# | 1| 25| 1| 21|
# +---+---+---+---+
但没有括号:
df1.join(
df2,
df1.id == df2.id
& abs(df1.val - df2.val) < 5
).show()
# ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.