如何在 PySpark 上同时将所有 int dtypes 转换为 double
How to convert all int dtypes to double simultanously on PySpark
这是我的 dataset
DataFrame[column1: double, column2: double, column3: int, column4: int, column5: int, ... , column300: int]
我要的是
DataFrame[column1: double, column2: double, column3: double, column4: double, column5: double, ... , column300: double]
我做了什么
dataset.withColumn("column3", datalabel.column3.cast(DoubleType()))
太手动了,你能告诉我怎么做吗?
您首先需要从可用架构中过滤掉 int
列类型。
然后与 reduce 结合,您可以遍历 DataFrame
将它们转换为您的选择
reduce
是一个非常重要且有用的功能,通常可用于导航 Spark 中的任何迭代用例
数据准备
df = pd.DataFrame({
'id':[f'id{i}' for i in range(0,10)],
'col1': [i for i in range(80,90)],
'col2': [i for i in range(5,15)],
'col3': [6,7,5,3,4,2,9,12,4,10]
})
sparkDF = sql.createDataFrame(df)
sparkDF.printSchema()
root
|-- id: string (nullable = true)
|-- col1: long (nullable = true)
|-- col2: long (nullable = true)
|-- col3: long (nullable = true)
身份证明
sparkDF.dtypes
## [('id', 'string'), ('col1', 'bigint'), ('col2', 'bigint'), ('col3', 'bigint')]
long_double_list = [ col for col,dtyp in sparkDF.dtypes if dtyp == 'bigint' ]
long_double_list
## ['col1', 'col2', 'col3']
减少
sparkDF = reduce(lambda df,c: df.withColumn(c,F.col(c).cast(DoubleType()))
,long_double_list
,sparkDF
)
sparkDF.printSchema()
root
|-- id: string (nullable = true)
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: double (nullable = true)
您可以使用列表解析来构建转换后的字段列表。
import pyspark.sql.functions as F
...
cols = [F.col(field[0]).cast('double') if field[1] == 'int' else F.col(field[0]) for field in df.dtypes]
df = df.select(cols)
df.printSchema()
这是我的 dataset
DataFrame[column1: double, column2: double, column3: int, column4: int, column5: int, ... , column300: int]
我要的是
DataFrame[column1: double, column2: double, column3: double, column4: double, column5: double, ... , column300: double]
我做了什么
dataset.withColumn("column3", datalabel.column3.cast(DoubleType()))
太手动了,你能告诉我怎么做吗?
您首先需要从可用架构中过滤掉 int
列类型。
然后与 reduce 结合,您可以遍历 DataFrame
将它们转换为您的选择
reduce
是一个非常重要且有用的功能,通常可用于导航 Spark 中的任何迭代用例
数据准备
df = pd.DataFrame({
'id':[f'id{i}' for i in range(0,10)],
'col1': [i for i in range(80,90)],
'col2': [i for i in range(5,15)],
'col3': [6,7,5,3,4,2,9,12,4,10]
})
sparkDF = sql.createDataFrame(df)
sparkDF.printSchema()
root
|-- id: string (nullable = true)
|-- col1: long (nullable = true)
|-- col2: long (nullable = true)
|-- col3: long (nullable = true)
身份证明
sparkDF.dtypes
## [('id', 'string'), ('col1', 'bigint'), ('col2', 'bigint'), ('col3', 'bigint')]
long_double_list = [ col for col,dtyp in sparkDF.dtypes if dtyp == 'bigint' ]
long_double_list
## ['col1', 'col2', 'col3']
减少
sparkDF = reduce(lambda df,c: df.withColumn(c,F.col(c).cast(DoubleType()))
,long_double_list
,sparkDF
)
sparkDF.printSchema()
root
|-- id: string (nullable = true)
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: double (nullable = true)
您可以使用列表解析来构建转换后的字段列表。
import pyspark.sql.functions as F
...
cols = [F.col(field[0]).cast('double') if field[1] == 'int' else F.col(field[0]) for field in df.dtypes]
df = df.select(cols)
df.printSchema()