当值为“”时,Pyspark 2.7 将数据框中的 StringType 列设置为 'null'
Pyspark 2.7 Set StringType columns in a dataframe to 'null' when value is ""
我有一个名为 good_df
的 DataFrame,它具有混合类型的列。我正在尝试将 StringType
列的任何空值设置为 'null'
。我认为下面的代码可以工作,但事实并非如此。
self.good_df = self.good_df.select([when((col(c)=='') & (isinstance(self.good_df.schema[c].dataType, StringType)),'null').otherwise(col(c)).alias(c) for c in self.good_df.columns])
我正在查看错误消息,它并没有给我太多线索:
Traceback (most recent call last): File "", line 1, in
File
"/usr/lib/python2.7/site-packages/pyspark/sql/column.py", line 116, in
_ njc = getattr(self._jc, name)(jc) File "/usr/lib/python2.7/site-packages/py4j/java_gateway.py", line 1257, in
call answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in
deco return f(*a, **kw) File
"/usr/lib/python2.7/site-packages/py4j/protocol.py", line 332, in
get_return_value format(target_id, ".", name, value)) Py4JError: An
error occurred while calling o792.and. Trace: py4j.Py4JException:
Method and([class java.lang.Boolean]) does not exist at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:748)
有人知道发生了什么事吗?
谢谢!
您收到的错误信息:
py4j.Py4JException: Method and([class java.lang.Boolean]) does not
exist
这意味着您正在尝试在 Column
表达式和文字 Boolean
值之间应用 AND
运算符。
您需要更改此部分:
(isinstance(self.good_df.schema[c].dataType, StringType))
至
from pyspark.sql.functions import lit
lit(isinstance(self.good_df.schema[c].dataType, StringType))
也就是说,实际上您可以将检查列类型的条件直接移动到 python list-comprehension 中:
self.good_df = self.good_df.select(*[
when((col(c) == ''), 'null').otherwise(col(c)).alias(c) if t == "string" else col(c)
for c, t in self.good_df.dtypes
])
我有一个名为 good_df
的 DataFrame,它具有混合类型的列。我正在尝试将 StringType
列的任何空值设置为 'null'
。我认为下面的代码可以工作,但事实并非如此。
self.good_df = self.good_df.select([when((col(c)=='') & (isinstance(self.good_df.schema[c].dataType, StringType)),'null').otherwise(col(c)).alias(c) for c in self.good_df.columns])
我正在查看错误消息,它并没有给我太多线索:
Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/site-packages/pyspark/sql/column.py", line 116, in _ njc = getattr(self._jc, name)(jc) File "/usr/lib/python2.7/site-packages/py4j/java_gateway.py", line 1257, in call answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/python2.7/site-packages/py4j/protocol.py", line 332, in get_return_value format(target_id, ".", name, value)) Py4JError: An error occurred while calling o792.and. Trace: py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
有人知道发生了什么事吗? 谢谢!
您收到的错误信息:
py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist
这意味着您正在尝试在 Column
表达式和文字 Boolean
值之间应用 AND
运算符。
您需要更改此部分:
(isinstance(self.good_df.schema[c].dataType, StringType))
至
from pyspark.sql.functions import lit
lit(isinstance(self.good_df.schema[c].dataType, StringType))
也就是说,实际上您可以将检查列类型的条件直接移动到 python list-comprehension 中:
self.good_df = self.good_df.select(*[
when((col(c) == ''), 'null').otherwise(col(c)).alias(c) if t == "string" else col(c)
for c, t in self.good_df.dtypes
])