FOR 运算符在 Spark SQL 中做什么?

What does FOR operator do in Spark SQL?

我在 scala repl 中使用 Spark SQL 并收到以下错误:

extraneous input ')' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_ DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_ USER', 'DATA', 'DATABASE', DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING ', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT', ' IKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON' 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'PLACING', 'POSITION', 'PR CEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REV WE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_ USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 36) 

我意识到我对其中一些运算符一无所知。我摆弄了一个,最终让它输出了一些东西:

spark.sql("select add from df FOR (select ,add ,none) where select = 1 ").show

这个 FOR 运算符是做什么的?运算符 PARTITIONSFOLLOWINGDISTRIBUTEGROUPING 的用途是什么?

  • FOR 是一些字符串函数中用于指定长度的关键字,在枢轴子句中用于枢轴值:
SELECT substring('Spark SQL' FROM 5 FOR 1)
-- returns 'k'

SELECT overlay('Spark SQL' PLACING 'ANSI ' FROM 7 FOR 0)
-- returns 'Spark ANSI SQL'

-- pivot
SELECT * FROM t
PIVOT (SUM(a) AS a FOR b IN ('x', 'y'))

from table FOR(...) 之后使用时,它也用于别名 table 列,相当于 as tab(...):

SELECT max(x) FROM VALUES (10), (50), (20) FOR(x) -- aliasing table columns
-- above is same as 
SELECT max(x) FROM VALUES (10), (50), (20) AS tab(x)
  • FOLLOWING 是一个关键字,用于在使用窗口函数时指定 Window 帧边界:
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING
  • DISTRIBUTE BY 是一个子句,用于根据输入表达式对数据进行重新分区:
SELECT a, b FROM t DISTRIBUTE BY b
  • GROUPING is a function that indicates whether a specified column in a GROUP BY is aggregated or not. There is also GROUPING SET used with GROUP BY 子句

可以参考官方文档Spark SQL, Built-in Functions and SQL Reference. And you can find many examples simply by searching the keywords you want in Apache Spark Github repo.