使用 SparklyR 从 Spark 数组列中提取元素 "select"
Extract elements from Spark array column using SparklyR "select"
我在 SparklyR 界面中有一个 Spark 数据框,我正在尝试从数组列中提取元素。
df <- copy_to(sc, data.frame(A=c(1,2),B=c(3,4))) ## BUILD DATAFRAME
dfnew <- df %>% mutate(C=Array(A,B)) %>% select(C) ## CREATE ARRAY COL
> dfnew ## VIEW DATAFRAME
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
dfnew %>% sdf_schema() ## VERIFY COLUMN TYPE IS ARRAY
$C$name
[1] "C"
$C$type
[1] "ArrayType(DoubleType,true)"
我可以用“mutate”提取一个元素...
dfnew %>% mutate(myfirst_element=C[[1]])
# Source: spark<?> [?? x 2]
C myfirst_element
<list> <dbl>
1 <dbl [2]> 3
2 <dbl [2]> 4
但我想用“select”即时提取一个元素。但是,所有尝试都只是 return 完整列:
> dfnew %>% select("C"[1])
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
> dfnew %>% select("C"[[1]])
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
> dfnew %>% select("C"[[1]][1])
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
> dfnew %>% select("C"[[1]][[1]])
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
我也尝试过使用“sdf_select”,但没有成功:
> dfnew %>% sdf_select("C"[[1]][1])
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
在 PySpark 中,您可以显式访问元素,例如科尔(“C”)[1];在 scala 中你可以使用 getItem 或 element_at;在 SparkR 中你也可以使用 element_at。但是有人知道 SparklyR 设置中的解决方案吗?在此先感谢您的帮助。
想到了以下解决方案。
library(tidyverse)
df = tibble(group = 1:5) %>%
mutate(C = map(group, ~array(c(1,2),c(3,4))))
df
# # A tibble: 5 x 2
# group C
# <int> <list>
# 1 1 <dbl [3 x 4]>
# 2 2 <dbl [3 x 4]>
# 3 3 <dbl [3 x 4]>
# 4 4 <dbl [3 x 4]>
# 5 5 <dbl [3 x 4]>
df$C
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
# [3,] 1 2 1 2
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
# [3,] 1 2 1 2
#
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
# [3,] 1 2 1 2
#
# [[4]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
# [3,] 1 2 1 2
#
# [[5]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
# [3,] 1 2 1 2
df %>% pull(C) %>% map(~.x[1,])
# [[1]]
# [1] 1 2 1 2
#
# [[2]]
# [1] 1 2 1 2
#
# [[3]]
# [1] 1 2 1 2
#
# [[4]]
# [1] 1 2 1 2
#
# [[5]]
# [1] 1 2 1 2
df %>% pull(C) %>% map(~.x[,2])
# [[1]]
# [1] 2 1 2
#
# [[2]]
# [1] 2 1 2
#
# [[3]]
# [1] 2 1 2
#
# [[4]]
# [1] 2 1 2
#
# [[5]]
# [1] 2 1 2
df %>% pull(C) %>% map(~.x[1:2,])
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
#
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
#
# [[4]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
#
# [[5]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
我想这就是您要找的。
当然,这也适用于任何大小的数组。
我在 SparklyR 界面中有一个 Spark 数据框,我正在尝试从数组列中提取元素。
df <- copy_to(sc, data.frame(A=c(1,2),B=c(3,4))) ## BUILD DATAFRAME
dfnew <- df %>% mutate(C=Array(A,B)) %>% select(C) ## CREATE ARRAY COL
> dfnew ## VIEW DATAFRAME
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
dfnew %>% sdf_schema() ## VERIFY COLUMN TYPE IS ARRAY
$C$name
[1] "C"
$C$type
[1] "ArrayType(DoubleType,true)"
我可以用“mutate”提取一个元素...
dfnew %>% mutate(myfirst_element=C[[1]])
# Source: spark<?> [?? x 2]
C myfirst_element
<list> <dbl>
1 <dbl [2]> 3
2 <dbl [2]> 4
但我想用“select”即时提取一个元素。但是,所有尝试都只是 return 完整列:
> dfnew %>% select("C"[1])
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
> dfnew %>% select("C"[[1]])
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
> dfnew %>% select("C"[[1]][1])
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
> dfnew %>% select("C"[[1]][[1]])
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
我也尝试过使用“sdf_select”,但没有成功:
> dfnew %>% sdf_select("C"[[1]][1])
# Source: spark<?> [?? x 1]
C
<list>
1 <dbl [2]>
2 <dbl [2]>
在 PySpark 中,您可以显式访问元素,例如科尔(“C”)[1];在 scala 中你可以使用 getItem 或 element_at;在 SparkR 中你也可以使用 element_at。但是有人知道 SparklyR 设置中的解决方案吗?在此先感谢您的帮助。
想到了以下解决方案。
library(tidyverse)
df = tibble(group = 1:5) %>%
mutate(C = map(group, ~array(c(1,2),c(3,4))))
df
# # A tibble: 5 x 2
# group C
# <int> <list>
# 1 1 <dbl [3 x 4]>
# 2 2 <dbl [3 x 4]>
# 3 3 <dbl [3 x 4]>
# 4 4 <dbl [3 x 4]>
# 5 5 <dbl [3 x 4]>
df$C
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
# [3,] 1 2 1 2
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
# [3,] 1 2 1 2
#
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
# [3,] 1 2 1 2
#
# [[4]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
# [3,] 1 2 1 2
#
# [[5]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
# [3,] 1 2 1 2
df %>% pull(C) %>% map(~.x[1,])
# [[1]]
# [1] 1 2 1 2
#
# [[2]]
# [1] 1 2 1 2
#
# [[3]]
# [1] 1 2 1 2
#
# [[4]]
# [1] 1 2 1 2
#
# [[5]]
# [1] 1 2 1 2
df %>% pull(C) %>% map(~.x[,2])
# [[1]]
# [1] 2 1 2
#
# [[2]]
# [1] 2 1 2
#
# [[3]]
# [1] 2 1 2
#
# [[4]]
# [1] 2 1 2
#
# [[5]]
# [1] 2 1 2
df %>% pull(C) %>% map(~.x[1:2,])
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
#
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
#
# [[4]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
#
# [[5]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 2
# [2,] 2 1 2 1
我想这就是您要找的。 当然,这也适用于任何大小的数组。