尝试使用 pyspark 指定用于加载 CSV 的架构时出错

Question

我正在尝试构建一个模式来加载包含以下内容的文件的内容-

movie_id, 字符串
movie_name, 字符串
情节，字符串
类型，字符串数组

这是一个示例-

这是我的架构定义-

customSchema = types.StructType(types.ArrayType(
                         types.StructField("movie_id", types.StringType(), True),
                         types.StructField("movie_name", types.StringType(), True),
                         types.StructField("plot", types.StringType(), True),
                         types.StructField('genre', 
                         types.ArrayType(types.StringType()), True),
))

这是我遇到的错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-290-10452c69a6ff> in <module>()
        3   types.StructField("movie_name", types.StringType(), True),
        4   types.StructField("plot", types.StringType(), True),
  ----> 5   types.StructField('genre', types.ArrayType(types.StringType()), True),
        6 ))

TypeError: __init__() takes from 2 to 3 positional arguments but 5 were given

Answer 1

StructType() is a list of StructField()s.

from pyspark.sql.types import *  

customSchema= StructType([StructField("movie_id", IntegerType()),
                        StructField("movie_name", StringType()),
                        StructField("plot", StringType()),
                          StructField("genre", ArrayType(StringType()))])

如果 genre 字符串列 看起来像 "['Indie', 'Drama', 'Action']"，请尝试将其转换为字符串数组：

from pyspark.sql import functions as F
df.withColumn("genre", F.split(F.regexp_replace("genre", "\[|]| |'", ""),",")).show(truncate=False)

#+----------------------+
#|genre                 |
#+----------------------+
#|[Indie, Drama, Action]|
#+----------------------+

尝试使用 pyspark 指定用于加载 CSV 的架构时出错

Error when trying to specify schema for loading a CSV using pyspark

apache-spark-sql

pyspark

pyspark-dataframes