Pyspark:使用参数动态准备 pyspark-sql 查询

Pyspark : Dynamically prepare pyspark-sql query using parameters

动态绑定参数和准备 pyspark-sql statament 的不同方法有哪些。

示例:

动态查询

query = '''SELECT column1, column2
           FROM ${db_name}.${table_name}
           WHERE column1 = ${filter_value}'''

上面的动态查询有${db_name}、${table_name}和${filter_value}变量,这些变量将从运行时间参数中获取值.

参数详情:

db_name = 'your_db_name'
table_name = 'your_table_name'
filter_value = 'some_value'

动态查询中绑定参数后的预期查询

SELECT column1, column2
FROM your_db_name.your_table_name
WHERE column1 = some_value  

这里有几个选项可以通过绑定参数准备 pyspark-sql。

选项#1 - 使用字符串插值/f-Strings (Python 3.6+)

db_name = 'your_db_name'
table_name = 'your_table_name'
filter_value = 'some_value'

query = f'''SELECT column1, column2
           FROM {db_name}.{table_name}
           WHERE column1 = {filter_value}'''

选项#2 - 使用字符串格式 (str.format)

query = '''SELECT column1, column2
           FROM {}.{}
           WHERE column1 = {}'''

db_name = 'your_db_name'
table_name = 'your_table_name'
filter_value = 'some_value'

query.format(db_name, table_name, filter_value)

选项#3 - 使用模板字符串

query = '''SELECT column1, column2
           FROM ${db_name}.${table_name}
           WHERE column1 = ${filter_value}'''

db_name = 'your_db_name'
table_name = 'your_table_name'
filter_value = 'some_value'

from string import Template
t = Template(query)
t.substitute(db_name=db_name, table_name=table_name, filter_value=filter_value)      
  • String Interpolation/f-Strings (Option#1) 如果你有 python 3.6+ 否则使用字符串格式 str.format (Option#2)

  • 模板字符串对于处理用户提供的字符串更有用 (选项#3)