使用 DATE 示例数据填充 pyspark 数据框
Populate a pyspark dataframe with DATE sample data
我尝试使用日期值创建和填充 pyspark 数据框。
Columns = ["EmployeeNo", "Name", "EmployeeID", "ValidFrom", "ValidTo"]
Data = [(100, "Hilmar Buchta", "HB", "2000-01-01", "2999-12-31"),
]
DfEmployee = spark.createDataFrame(Data, Columns)
DfEmployee.show()
给予
+----------+----------------+----------+----------+----------+
|EmployeeNo| Name|EmployeeID| ValidFrom| ValidTo|
+----------+----------------+----------+----------+----------+
| 100| Hilmar Buchta| HB|2000-01-01|2999-12-31|
+----------+----------------+----------+----------+----------+
看起来正确,但 ValidFrom
和 ValidTo
值是字符串,而不是日期。如何在一步中使用日期类型的值填充 df 列?
我在 Whosebug 上搜索了一段时间并尝试了这个:
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType, BooleanType
Schema = StructType([
StructField('EmployeeNo', IntegerType(), False),
StructField('Name', StringType(), False),
StructField('EmployeeID', StringType(), False),
StructField('ValidFrom', DateType(), False),
StructField('ValidTo', DateType(), False)
])
Data = [(100, "Hilmar Buchta", "HB", "2000-01-01", "2999-12-31"),]
DfEmployee = spark.createDataFrame(Data, Columns)
给予
TypeError: field ValidFrom: DateType can not accept object
'2000-01-01' in type <class 'str'>
所以我尝试了
Data = [(100, "Hilmar Buchta", "HB", F.to_date("2000-01-01", "yyyy-MM-dd"), F.to_date("2999-12-31", "yyyy-MM-dd")),]
]
DfEmployee = spark.createDataFrame(Data, Columns)
给予
TypeError: field ValidFrom: DateType can not accept object
Column<b"to_date(2000-01-01
, 'yyyy-MM-dd')"> in type <class
'pyspark.sql.column.Column'>
您可以传递 python datetime.date
对象而不是字符串 :
import datetime
Data = [
(100, "Hilmar Buchta", "HB", datetime.date(2000, 1, 1), datetime.date(2999, 12, 31)),
]
DfEmployee = spark.createDataFrame(Data, Columns)
DfEmployee.printSchema()
#root
# |-- EmployeeNo: long (nullable = true)
# |-- Name: string (nullable = true)
# |-- EmployeeID: string (nullable = true)
# |-- ValidFrom: date (nullable = true)
# |-- ValidTo: date (nullable = true)
或者将字符串转换为 python 日期对象:
from datetime import datetime
Data = [
(100, "Hilmar Buchta", "HB", datetime.strptime("2000-01-01", "%Y-%M-%d").date(),
datetime.strptime("2999-12-31", "%Y-%M-%d").date()
),
]
函数F.to_date
仅与DataFrame一起使用。您可以在创建 df 后将字符串转换为日期,例如:
df = df.withColumn("ValidFrom", F.to_date("ValidFrom", "yyyy-MM-dd"))
我尝试使用日期值创建和填充 pyspark 数据框。
Columns = ["EmployeeNo", "Name", "EmployeeID", "ValidFrom", "ValidTo"]
Data = [(100, "Hilmar Buchta", "HB", "2000-01-01", "2999-12-31"),
]
DfEmployee = spark.createDataFrame(Data, Columns)
DfEmployee.show()
给予
+----------+----------------+----------+----------+----------+
|EmployeeNo| Name|EmployeeID| ValidFrom| ValidTo|
+----------+----------------+----------+----------+----------+
| 100| Hilmar Buchta| HB|2000-01-01|2999-12-31|
+----------+----------------+----------+----------+----------+
看起来正确,但 ValidFrom
和 ValidTo
值是字符串,而不是日期。如何在一步中使用日期类型的值填充 df 列?
我在 Whosebug 上搜索了一段时间并尝试了这个:
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType, BooleanType
Schema = StructType([
StructField('EmployeeNo', IntegerType(), False),
StructField('Name', StringType(), False),
StructField('EmployeeID', StringType(), False),
StructField('ValidFrom', DateType(), False),
StructField('ValidTo', DateType(), False)
])
Data = [(100, "Hilmar Buchta", "HB", "2000-01-01", "2999-12-31"),]
DfEmployee = spark.createDataFrame(Data, Columns)
给予
TypeError: field ValidFrom: DateType can not accept object '2000-01-01' in type <class 'str'>
所以我尝试了
Data = [(100, "Hilmar Buchta", "HB", F.to_date("2000-01-01", "yyyy-MM-dd"), F.to_date("2999-12-31", "yyyy-MM-dd")),]
]
DfEmployee = spark.createDataFrame(Data, Columns)
给予
TypeError: field ValidFrom: DateType can not accept object Column<b"to_date(
2000-01-01
, 'yyyy-MM-dd')"> in type <class 'pyspark.sql.column.Column'>
您可以传递 python datetime.date
对象而不是字符串 :
import datetime
Data = [
(100, "Hilmar Buchta", "HB", datetime.date(2000, 1, 1), datetime.date(2999, 12, 31)),
]
DfEmployee = spark.createDataFrame(Data, Columns)
DfEmployee.printSchema()
#root
# |-- EmployeeNo: long (nullable = true)
# |-- Name: string (nullable = true)
# |-- EmployeeID: string (nullable = true)
# |-- ValidFrom: date (nullable = true)
# |-- ValidTo: date (nullable = true)
或者将字符串转换为 python 日期对象:
from datetime import datetime
Data = [
(100, "Hilmar Buchta", "HB", datetime.strptime("2000-01-01", "%Y-%M-%d").date(),
datetime.strptime("2999-12-31", "%Y-%M-%d").date()
),
]
函数F.to_date
仅与DataFrame一起使用。您可以在创建 df 后将字符串转换为日期,例如:
df = df.withColumn("ValidFrom", F.to_date("ValidFrom", "yyyy-MM-dd"))