TypeError: Column is not iterable

Question

s = ["abcd:{'name':'john'}","defasdf:{'num':123}"]
df = spark.createDataFrame(s, "string").toDF("request")
display(df)

+--------------------+
|             request|
+--------------------+
|abcd:{'name':'john'}|
| defasdf:{'num':123}|
+--------------------+

我想得到

+--------------------+---------------+
|             request|            sub|
+--------------------+---------------+
|abcd:{'name':'john'}|{'name':'john'}|
| defasdf:{'num':123}|    {'num':123}|
+--------------------+---------------+

我确实写了如下，但是它抛出错误：

TypeError: Column is not iterable

df = df.withColumn("sub",substring(col('request'),locate('{',col('request')),length(col('request'))-locate('{',col('request'))))
df.show()

有人可以帮我吗？

Answer 1

您需要在 SQL 表达式中使用 substring 函数，以便为 position 和 length 参数传递列。另请注意，您需要将 +1 添加到长度以获得正确的结果：

import pyspark.sql.functions as F

df = df.withColumn(
    "json",
    F.expr("substring(request, locate('{',request), length(request) - locate('{', request) + 1)")
)

df.show()
#+--------------------+---------------+
#|             request|           json|
#+--------------------+---------------+
#|abcd:{'name':'john'}|{'name':'john'}|
#| defasdf:{'num':123}|    {'num':123}|
#+--------------------+---------------+

您也可以考虑使用 regexp_extract 函数而不是像这样的子字符串：

df = df.withColumn(
    "json",
    F.regexp_extract("request", "^.*:(\{.*\})$", 1)
)

TypeError: Column is not iterable

TypeError: Column is not iterable

python

apache-spark

apache-spark-sql

pyspark