加入sparklyr后过滤报错
Error in filtering after join in sparklyr
我使用的代码如下所示:(只是一个简单的连接)
tbl(sc, 'dez') %>% inner_join(tbl(sc, 'deg'), by = c("timefrom" = "timefromdeg", "elemuid")) %>%
filter(number.x > 2500) %>% glimpse()
单个数据框的内容无关紧要。连接本身会起作用。为了节省计算能力,我想在加入后直接过滤(或其他)。
但现在我收到 Spark 无法解析变量 number.x 的错误消息。
我不明白,因为变量是错误消息的一部分:
Error: org.apache.spark.sql.AnalysisException: cannot resolve '`number.x`' given input columns: [elemname.x, kind.y, timefrom, timetodeg, timeto, kind.x, elemuid, elemname.y, number.y, number.x]; line 7 pos 7;
'Project [*]
+- 'Filter ('number.x > 2500.0)
+- SubqueryAlias yoxgbdyqlw
+- Project [elemuid#7505 AS elemuid#7495, elemname#7506 AS elemname.x#7496, kind#7507 AS kind.x#7497, number#7508 AS number.x#7498, timefrom#7509 AS timefrom#7499, timeto#7510 AS timeto#7500, elemname#7512 AS elemname.y#7501, kind#7513 AS kind.y#7502, number#7514 AS number.y#7503, timetodeg#7516 AS timetodeg#7504]
+- Join Inner, ((timefrom#7509 = timefromdeg#7515) && (elemuid#7505 = elemuid#7511))
:- SubqueryAlias TBL_LEFT
: +- SubqueryAlias dez
: +- HiveTableRelation `default`.`dez`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [elemuid#7505, elemname#7506, kind#7507, number#7508, timefrom#7509, timeto#7510]
+- SubqueryAlias TBL_RIGHT
+- SubqueryAlias deg
+- HiveTableRelation `default`.`deg`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [elemuid#7511, elemname#7512, kind#7513, number#7514, timefromdeg#7515, timetodeg#7516]
A collect()
加入后不是一个选项,因为那时我 运行 内存不足。有没有可能让事情发生。
如果有帮助我会很高兴!
TL;DR 不要使用默认 suffix
(c(".x", ".y")
):
set.seed(1)
df1 <- copy_to(sc, tibble(id = 1:3, value = rnorm(3)))
df2 <- copy_to(sc, tibble(id = 1:3, value = rnorm(3)))
df1 %>%
inner_join(df2, by = c("id"), suffix=c("_x", "_y")) %>%
filter(value_y > -0.836)
# # Source: lazy query [?? x 3]
# # Database: spark_connection
# id value_x value_y
# <dbl> <dbl> <dbl>
# 1 1. -0.626 1.60
# 2 2. 0.184 0.330
# 3 3. -0.836 -0.820
问题:
Spark 允许深度嵌套结构和复杂类型。 struct
使用点语法访问(还记得 访问吗?),在字段上有完整路径。这就是为什么像 number.x
这样的名字是模棱两可的。
通常我们使用反引号转义查询
`number.x`
但据我所知,无法用 dplyr
API 表达这一点(也许一些 rlang
技巧可以,但我现在想不出任何).
这个问题并不是特定于联接的。您应该避免在任何名称中使用 .
。如果出于某种原因,您可以随时使用本机 Spark API 并解决问题:
df3 <- copy_to(sc, tibble(value.x = rnorm(42)))
df3 %>%
spark_dataframe() %>%
invoke("withColumnRenamed", "`value.x`", "value_x") %>%
sdf_register()
# # Source: table<sparklyr_tmp_61acdbbc592> [?? x 1]
# # Database: spark_connection
# value_x
# <dbl>
# 1 -0.0162
# 2 0.944
# 3 0.821
# 4 0.594
# 5 0.919
# 6 0.782
# 7 0.0746
# 8 -1.99
# 9 0.620
# 10 -0.0561
# # ... with more r
我使用的代码如下所示:(只是一个简单的连接)
tbl(sc, 'dez') %>% inner_join(tbl(sc, 'deg'), by = c("timefrom" = "timefromdeg", "elemuid")) %>%
filter(number.x > 2500) %>% glimpse()
单个数据框的内容无关紧要。连接本身会起作用。为了节省计算能力,我想在加入后直接过滤(或其他)。
但现在我收到 Spark 无法解析变量 number.x 的错误消息。
我不明白,因为变量是错误消息的一部分:
Error: org.apache.spark.sql.AnalysisException: cannot resolve '`number.x`' given input columns: [elemname.x, kind.y, timefrom, timetodeg, timeto, kind.x, elemuid, elemname.y, number.y, number.x]; line 7 pos 7;
'Project [*]
+- 'Filter ('number.x > 2500.0)
+- SubqueryAlias yoxgbdyqlw
+- Project [elemuid#7505 AS elemuid#7495, elemname#7506 AS elemname.x#7496, kind#7507 AS kind.x#7497, number#7508 AS number.x#7498, timefrom#7509 AS timefrom#7499, timeto#7510 AS timeto#7500, elemname#7512 AS elemname.y#7501, kind#7513 AS kind.y#7502, number#7514 AS number.y#7503, timetodeg#7516 AS timetodeg#7504]
+- Join Inner, ((timefrom#7509 = timefromdeg#7515) && (elemuid#7505 = elemuid#7511))
:- SubqueryAlias TBL_LEFT
: +- SubqueryAlias dez
: +- HiveTableRelation `default`.`dez`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [elemuid#7505, elemname#7506, kind#7507, number#7508, timefrom#7509, timeto#7510]
+- SubqueryAlias TBL_RIGHT
+- SubqueryAlias deg
+- HiveTableRelation `default`.`deg`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [elemuid#7511, elemname#7512, kind#7513, number#7514, timefromdeg#7515, timetodeg#7516]
A collect()
加入后不是一个选项,因为那时我 运行 内存不足。有没有可能让事情发生。
如果有帮助我会很高兴!
TL;DR 不要使用默认 suffix
(c(".x", ".y")
):
set.seed(1)
df1 <- copy_to(sc, tibble(id = 1:3, value = rnorm(3)))
df2 <- copy_to(sc, tibble(id = 1:3, value = rnorm(3)))
df1 %>%
inner_join(df2, by = c("id"), suffix=c("_x", "_y")) %>%
filter(value_y > -0.836)
# # Source: lazy query [?? x 3]
# # Database: spark_connection
# id value_x value_y
# <dbl> <dbl> <dbl>
# 1 1. -0.626 1.60
# 2 2. 0.184 0.330
# 3 3. -0.836 -0.820
问题:
Spark 允许深度嵌套结构和复杂类型。 struct
使用点语法访问(还记得 number.x
这样的名字是模棱两可的。
通常我们使用反引号转义查询
`number.x`
但据我所知,无法用 dplyr
API 表达这一点(也许一些 rlang
技巧可以,但我现在想不出任何).
这个问题并不是特定于联接的。您应该避免在任何名称中使用 .
。如果出于某种原因,您可以随时使用本机 Spark API 并解决问题:
df3 <- copy_to(sc, tibble(value.x = rnorm(42)))
df3 %>%
spark_dataframe() %>%
invoke("withColumnRenamed", "`value.x`", "value_x") %>%
sdf_register()
# # Source: table<sparklyr_tmp_61acdbbc592> [?? x 1]
# # Database: spark_connection
# value_x
# <dbl>
# 1 -0.0162
# 2 0.944
# 3 0.821
# 4 0.594
# 5 0.919
# 6 0.782
# 7 0.0746
# 8 -1.99
# 9 0.620
# 10 -0.0561
# # ... with more r