从 pyspark df 列中删除最后一个字符

Question

我正在像这样使用 pyspark 读取 csv 文件：df = spark.read.format('csv').options(header=True, encoding='windows-1251',delimiter=';').load('csv_file.csv')

在列的结果中我得到了带有“'”单引号字符的字符串，像这样12435'

文件中没有一行末尾有引号，idk where spark finds it

我需要删除这个引用

btw pandas 读取每行末尾没有引号的 csv，但我无法将 pd.DF 转换为 spark.DF，我收到错误 cannot merge type DoubleType and StringType

DF 有一些空列

我试过了：

from pyspark.sql.functions import *
for i in df.columns:
    df.withColumn(i, expr("substring({name}, 1, length({name}) -1)".format(name=i)))
 
 
for i in df.columns:
    df.withColumn(i, col(i).substr(lit(0), length(col(i)) - 1))

none 对我有帮助

有

阅读 df

col1 | col2
12345' abcde'

预期输出

col1 | col2
12345  abcde

Answer 1

使用列表理解

df.select(*[regexp_replace(F.col(c),"'",'').alias(c) for c in df.columns]).show()

+-----+-----+
| col1| col2|
+-----+-----+
|12345|abcde|
+-----+-----+

从 pyspark df 列中删除最后一个字符

remove last character from pyspark df columns

python

apache-spark

pyspark