从 JDBC 连接读取时如何使用谓词?
How to use a predicate while reading from JDBC connection?
默认情况下,spark_read_jdbc()
将整个数据库 table 读入 Spark。我使用以下语法来创建这些连接。
library(sparklyr)
library(dplyr)
config <- spark_config()
config$`sparklyr.shell.driver-class-path` <- "mysql-connector-java-5.1.43/mysql-connector-java-5.1.43-bin.jar"
sc <- spark_connect(master = "local",
version = "1.6.0",
hadoop_version = 2.4,
config = config)
db_tbl <- sc %>%
spark_read_jdbc(sc = .,
name = "table_name",
options = list(url = "jdbc:mysql://localhost:3306/schema_name",
user = "root",
password = "password",
dbtable = "table_name"))
但是,我现在遇到了这样的情况,我在 MySQL 数据库中有一个 table,我宁愿只将这个 table 的一个子集读入 Spark .
如何让 spark_read_jdbc
接受谓词?我试过将谓词添加到选项列表但没有成功,
db_tbl <- sc %>%
spark_read_jdbc(sc = .,
name = "table_name",
options = list(url = "jdbc:mysql://localhost:3306/schema_name",
user = "root",
password = "password",
dbtable = "table_name",
predicates = "field > 1"))
您可以将 dbtable
替换为查询:
db_tbl <- sc %>%
spark_read_jdbc(sc = .,
name = "table_name",
options = list(url = "jdbc:mysql://localhost:3306/schema_name",
user = "root",
password = "password",
dbtable = "(SELECT * FROM table_name WHERE field > 1) as my_query"))
但是像这样的简单条件,Spark 应该会在您过滤时自动推送它:
db_tbl %>% filter(field > 1)
只需确保设置:
memory = FALSE
在 spark_read_jdbc
.
默认情况下,spark_read_jdbc()
将整个数据库 table 读入 Spark。我使用以下语法来创建这些连接。
library(sparklyr)
library(dplyr)
config <- spark_config()
config$`sparklyr.shell.driver-class-path` <- "mysql-connector-java-5.1.43/mysql-connector-java-5.1.43-bin.jar"
sc <- spark_connect(master = "local",
version = "1.6.0",
hadoop_version = 2.4,
config = config)
db_tbl <- sc %>%
spark_read_jdbc(sc = .,
name = "table_name",
options = list(url = "jdbc:mysql://localhost:3306/schema_name",
user = "root",
password = "password",
dbtable = "table_name"))
但是,我现在遇到了这样的情况,我在 MySQL 数据库中有一个 table,我宁愿只将这个 table 的一个子集读入 Spark .
如何让 spark_read_jdbc
接受谓词?我试过将谓词添加到选项列表但没有成功,
db_tbl <- sc %>%
spark_read_jdbc(sc = .,
name = "table_name",
options = list(url = "jdbc:mysql://localhost:3306/schema_name",
user = "root",
password = "password",
dbtable = "table_name",
predicates = "field > 1"))
您可以将 dbtable
替换为查询:
db_tbl <- sc %>%
spark_read_jdbc(sc = .,
name = "table_name",
options = list(url = "jdbc:mysql://localhost:3306/schema_name",
user = "root",
password = "password",
dbtable = "(SELECT * FROM table_name WHERE field > 1) as my_query"))
但是像这样的简单条件,Spark 应该会在您过滤时自动推送它:
db_tbl %>% filter(field > 1)
只需确保设置:
memory = FALSE
在 spark_read_jdbc
.