pyspark 数据框中的列重命名

Column Renaming in pyspark dataframe

我有带有特殊字符的列名。我重命名了该列并尝试保存,但它给出了保存失败的信息,说这些列具有特殊字符。我 运行 数据框上的打印模式,我看到没有任何特殊字符的列名。这是我试过的代码。

for c in df_source.columns:
    df_source = df_source.withColumnRenamed(c, c.replace( "(" , ""))
    df_source = df_source.withColumnRenamed(c, c.replace( ")" , ""))
    df_source = df_source.withColumnRenamed(c, c.replace( "." , ""))

df_source.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(stg_location)

我收到以下错误

Caused by: org.apache.spark.sql.AnalysisException: Attribute name "Number_of_data_samples_(input)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.

我还注意到的另一件事是当我执行 df_source.show() 或显示 (df_source) 时,两者都显示相同的错误并且 printschema 显示没有特殊字符。

谁能帮我找到解决方案。

尝试如下使用它 -

Input_df

from pyspark.sql.types import *
from pyspark.sql.functions import *

data = [("xyz", 1)]

schema = StructType([StructField("Number_of_data_samples_(input)", StringType(), True), StructField("id", IntegerType())])

df = spark.createDataFrame(data=data, schema=schema)

df.show()

+------------------------------+---+
|Number_of_data_samples_(input)| id|
+------------------------------+---+
|                           xyz|  1|
+------------------------------+---+

方法一 Using regular expressions 替换特殊字符,然后使用 toDF()

import re

cols=[re.sub("\.|\)|\(","",i) for i in df.columns]
df.toDF(*cols).show()

+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
|                         xyz|  1|
+----------------------------+---+

方法二 Using .withColumnRenamed()

for i,j in zip(df.columns,cols):
    df=df.withColumnRenamed(i,j)

df.show()

+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
|                         xyz|  1|
+----------------------------+---+

方法三 Using .withColumn 创建新列并删除现有列

df = df.withColumn("Number_of_data_samples_input", lit(col("Number_of_data_samples_(input)"))).drop(col("Number_of_data_samples_(input)"))

df.show()

+---+----------------------------+
| id|Number_of_data_samples_input|
+---+----------------------------+
|  1|                         xyz|
+---+----------------------------+