TypeError: Invalid argument, not a string or column: [79, -1, -1] of type <class 'list'> column literals use 'lit' 'array' 'struct' or 'create_map'
TypeError: Invalid argument, not a string or column: [79, -1, -1] of type <class 'list'> column literals use 'lit' 'array' 'struct' or 'create_map'
我在 PySpark UDF 中遇到问题,它抛出了以下错误
PythonException: An exception was thrown from a UDF: 'TypeError: Invalid argument, not a string or column: [79, -1, -1] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
我正在尝试根据优先级数组获取一个数字,并且它们的优先级从左到右递减。
precedence = [1,2,11,12,13,20,20131,200,202,203,210,220,223,226,235,236,237,242,244,245,247,253,254,257,259,260,262,278,283,701,20107,20108,20109,20112,20115,20123,20135,20141,20144,20152,20162,20163,20167,20168,20169,20170,20171,20172,20173,20174,20175,14,211,213,258,270,273,274,275,277,280,281,287,288,20120,20122,20124,20125,20126,20130,20133,20136,20137,20138,20140,20142,20143,20154,20155,20156,20157]
reverse_order = precedence[::-1]
def get_p(row):
if (row!=None) and (row!="null"):
temp = row.split(",")
test = []
for i in temp:
if (i.find('=')!=-1):
i = i.split('=')[0]
if int(i) in reverse_order:
test.append(reverse_order.index(int(i)))
else:
test.append(-1)
if max(test)!=-1:
return reverse_order[max(test)]
return -999
else:
return None
get_array = udf(get_p, IntegerType())
bronze_table = bronze_table.withColumn("precedence", get_array("event_list"))
bronze_table.select("event_list","precedence").show(100, False)
这是示例记录,
+---------------------------------------------------------------------------------------+
|event_list |
+---------------------------------------------------------------------------------------+
|276,100,101,202,176 |
|276,100,2,124,176 |
|246,100,101,257,115,116,121,123,124,125,135,138,145,146,153,167,168,170,171,173,189,191|
|246,100,101,278,123,124,135,170,189,191 |
|20131=16,20151,100,101,102,115,116,121,123,124,125,135,138,145,146,153,168,170,171 |
|null |
|20107=9,20151,100,101,102,123,124,135,170,189,191 |
|20108=3,20151,100,101,102,123,124,125,135,170,171,189,191 |
|null |
+---------------------------------------------------------------------------------------+
我期待什么
+---------------------------------------------------------------------------------------+----------+
|event_list |precedence|
+---------------------------------------------------------------------------------------+----------+
|276,100,101,202,176 |202 |
|276,100,2,124,176 |2 |
|246,100,101,257,115,116,121,123,124,125,135,138,145,146,153,167,168,170,171,173,189,191|257 |
|246,100,101,278,123,124,135,170,189,191 |278 |
|20131=16,20151,100,101,102,115,116,121,123,124,125,135,138,145,146,153,168,170,171 |20131 |
|null |null |
|20107=9,20151,100,101,102,123,124,135,170,189,191 |20107 |
|20108=3,20151,100,101,102,123,124,125,135,170,171,189,191 |20108 |
|null |null |
+---------------------------------------------------------------------------------------+----------+
我的 UDF
在 python 中按预期工作,但在 pyspark 中不工作。我请求有人帮助我解决这个问题。
PySpark dataframe 中的 null
是 Python 中的 None
,所以这个条件 if row!="null":
是不正确的。请尝试 if row!= None:
。
但是,您的 get_p
函数对我来说 运行 不太好,例如:
get_p('276,100,101,202,176')
# output: 2
# expected: 202
get_p('20131=16,20151,100,101,102,115,116,121,123,124,125,135,138,145,146,153,168,170,171')
# output: Exception `invalid literal for int() with base 10: ''`
# expected: 20131
谢谢大家,解决了问题。我可以用这个 link
来解决
max 函数的问题,如 link 中所述。
这是更新后的代码。
import builtins as p
def get_p(row):
if (row!=None) and (row!="null"):
temp = row.split(",")
test = []
for i in temp:
if (i.find('=')!=-1):
i = i.split('=')[0]
if int(i) in reverse_order:
test.append(reverse_order.index(int(i)))
else:
test.append(-1)
if p.max(test)!=-1:
return reverse_order[p.max(test)]
return None
else:
return None
我在 PySpark UDF 中遇到问题,它抛出了以下错误
PythonException: An exception was thrown from a UDF: 'TypeError: Invalid argument, not a string or column: [79, -1, -1] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
我正在尝试根据优先级数组获取一个数字,并且它们的优先级从左到右递减。
precedence = [1,2,11,12,13,20,20131,200,202,203,210,220,223,226,235,236,237,242,244,245,247,253,254,257,259,260,262,278,283,701,20107,20108,20109,20112,20115,20123,20135,20141,20144,20152,20162,20163,20167,20168,20169,20170,20171,20172,20173,20174,20175,14,211,213,258,270,273,274,275,277,280,281,287,288,20120,20122,20124,20125,20126,20130,20133,20136,20137,20138,20140,20142,20143,20154,20155,20156,20157]
reverse_order = precedence[::-1]
def get_p(row):
if (row!=None) and (row!="null"):
temp = row.split(",")
test = []
for i in temp:
if (i.find('=')!=-1):
i = i.split('=')[0]
if int(i) in reverse_order:
test.append(reverse_order.index(int(i)))
else:
test.append(-1)
if max(test)!=-1:
return reverse_order[max(test)]
return -999
else:
return None
get_array = udf(get_p, IntegerType())
bronze_table = bronze_table.withColumn("precedence", get_array("event_list"))
bronze_table.select("event_list","precedence").show(100, False)
这是示例记录,
+---------------------------------------------------------------------------------------+
|event_list |
+---------------------------------------------------------------------------------------+
|276,100,101,202,176 |
|276,100,2,124,176 |
|246,100,101,257,115,116,121,123,124,125,135,138,145,146,153,167,168,170,171,173,189,191|
|246,100,101,278,123,124,135,170,189,191 |
|20131=16,20151,100,101,102,115,116,121,123,124,125,135,138,145,146,153,168,170,171 |
|null |
|20107=9,20151,100,101,102,123,124,135,170,189,191 |
|20108=3,20151,100,101,102,123,124,125,135,170,171,189,191 |
|null |
+---------------------------------------------------------------------------------------+
我期待什么
+---------------------------------------------------------------------------------------+----------+
|event_list |precedence|
+---------------------------------------------------------------------------------------+----------+
|276,100,101,202,176 |202 |
|276,100,2,124,176 |2 |
|246,100,101,257,115,116,121,123,124,125,135,138,145,146,153,167,168,170,171,173,189,191|257 |
|246,100,101,278,123,124,135,170,189,191 |278 |
|20131=16,20151,100,101,102,115,116,121,123,124,125,135,138,145,146,153,168,170,171 |20131 |
|null |null |
|20107=9,20151,100,101,102,123,124,135,170,189,191 |20107 |
|20108=3,20151,100,101,102,123,124,125,135,170,171,189,191 |20108 |
|null |null |
+---------------------------------------------------------------------------------------+----------+
我的 UDF
在 python 中按预期工作,但在 pyspark 中不工作。我请求有人帮助我解决这个问题。
null
是 Python 中的 None
,所以这个条件 if row!="null":
是不正确的。请尝试 if row!= None:
。
但是,您的 get_p
函数对我来说 运行 不太好,例如:
get_p('276,100,101,202,176')
# output: 2
# expected: 202
get_p('20131=16,20151,100,101,102,115,116,121,123,124,125,135,138,145,146,153,168,170,171')
# output: Exception `invalid literal for int() with base 10: ''`
# expected: 20131
谢谢大家,解决了问题。我可以用这个 link
max 函数的问题,如 link 中所述。
这是更新后的代码。
import builtins as p
def get_p(row):
if (row!=None) and (row!="null"):
temp = row.split(",")
test = []
for i in temp:
if (i.find('=')!=-1):
i = i.split('=')[0]
if int(i) in reverse_order:
test.append(reverse_order.index(int(i)))
else:
test.append(-1)
if p.max(test)!=-1:
return reverse_order[p.max(test)]
return None
else:
return None