火花提交:未定义的函数parse_url
Spark-submit: undefined function parse_url
函数 - parse_url 如果我们使用 spark-sql throw sql-client(通过 thrift 服务器),IPython,pyspark-shell,但它不起作用 throw spark-submit mode:
/opt/spark/bin/spark-submit --driver-memory 4G --executor-memory 8G main.py
错误是:
Traceback (most recent call last):
File "/home/spark/***/main.py", line 167, in <module>
)v on registrations.ga = v.ga and reg_path = oldtrack_page and registration_day = day_cl_log and date_cl_log <= registration_date""")
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 552, in sql
File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 40, in deco
pyspark.sql.utils.AnalysisException: undefined function parse_url;
Build step 'Execute shell' marked build as failure
Finished: FAILURE
所以,我们在这里使用解决方法:
def python_parse_url(url, que, key):
import urlparse
ians = None
if que == "QUERY":
ians = urlparse.parse_qs(urlparse.urlparse(url).query)[key][0]
elif que == "HOST":
ians = urlparse.urlparse(url).hostname
elif que == "PATH":
ians = urlparse.urlparse(url).path
return ians
def dc_python_parse_url(url, que, key):
ians = None
try:
ians = python_parse_url(url, que, key)
except:
pass
return ians
sqlCtx.registerFunction('my_parse_url', dc_python_parse_url)
请问有什么帮助解决这个问题吗?
Spark >= 2.0
与下面相同,但在启用 Hive 支持的情况下使用 SparkSession
:
SparkSession.builder.enableHiveSupport().getOrCreate()
Spark < 2.0
parse_url
不是 经典 sql 函数。它是一个 Hive UDF,因此需要 HiveContext
才能工作:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
hivContext = HiveContext(sc)
query = """SELECT parse_url('http://example.com/foo/bar?foo=bar', 'HOST')"""
sqlContext.sql(query)
## Py4JJavaError Traceback (most recent call last)
## ...
## AnalysisException: 'undefined function parse_url;'
hivContext.sql(query)
## DataFrame[_c0: string]
函数 - parse_url 如果我们使用 spark-sql throw sql-client(通过 thrift 服务器),IPython,pyspark-shell,但它不起作用 throw spark-submit mode:
/opt/spark/bin/spark-submit --driver-memory 4G --executor-memory 8G main.py
错误是:
Traceback (most recent call last):
File "/home/spark/***/main.py", line 167, in <module>
)v on registrations.ga = v.ga and reg_path = oldtrack_page and registration_day = day_cl_log and date_cl_log <= registration_date""")
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 552, in sql
File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 40, in deco
pyspark.sql.utils.AnalysisException: undefined function parse_url;
Build step 'Execute shell' marked build as failure
Finished: FAILURE
所以,我们在这里使用解决方法:
def python_parse_url(url, que, key):
import urlparse
ians = None
if que == "QUERY":
ians = urlparse.parse_qs(urlparse.urlparse(url).query)[key][0]
elif que == "HOST":
ians = urlparse.urlparse(url).hostname
elif que == "PATH":
ians = urlparse.urlparse(url).path
return ians
def dc_python_parse_url(url, que, key):
ians = None
try:
ians = python_parse_url(url, que, key)
except:
pass
return ians
sqlCtx.registerFunction('my_parse_url', dc_python_parse_url)
请问有什么帮助解决这个问题吗?
Spark >= 2.0
与下面相同,但在启用 Hive 支持的情况下使用 SparkSession
:
SparkSession.builder.enableHiveSupport().getOrCreate()
Spark < 2.0
parse_url
不是 经典 sql 函数。它是一个 Hive UDF,因此需要 HiveContext
才能工作:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
hivContext = HiveContext(sc)
query = """SELECT parse_url('http://example.com/foo/bar?foo=bar', 'HOST')"""
sqlContext.sql(query)
## Py4JJavaError Traceback (most recent call last)
## ...
## AnalysisException: 'undefined function parse_url;'
hivContext.sql(query)
## DataFrame[_c0: string]