Pyspark 拆分函数数小时
Pyspark Split Function for Hours
Screenshot of the code
root
|-- address: string (nullable = true)
|-- attributes: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- business_id: string (nullable = true)
|-- categories: string (nullable = true)
|-- city: string (nullable = true)
|-- hours: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- is_open: long (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- name: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- review_count: long (nullable = true)
|-- stars: double (nullable = true)
|-- state: string (nullable = true)
我目前正在使用 Yelp 的数据集,我的 objective 是要查找一家企业营业的总时数 day/week。从数据中,我能够为某一天提取一个看起来像 [9:0, 0:0] 的时间范围。如何使用 pyspark 获取两列,一列用于显示 [9:0] 的开放时间,一列用于显示 [0:0]?
的关闭时间
这是我用来在数据集中简单显示企业营业时间的一些代码。
import pyspark.sql.functions as f
from pyspark.sql.functions import expr
df_hours = df_MappedBusiness.select(
"business_id",
"name",
f.explode("hours").alias("hourDay","hourValue"),
f.split("hourValue", "[-]").alias("split_hours")
).show(50, truncate=False)
Expected Output
---------------
+----------------------------------------------------------------
|hourDay |hourValue |split_hours | open_hours | close_hours
+-----------------------------------------------------------------
|Monday |9:0-0:0 |[9:0, 0:0] | [9,0] | [0,0] |
调用 pyspark.sql.functions.split
后,您将创建 ArrayType
的 Column
(进一步保存字符串)。要访问嵌套列中的元素,您将使用与列表甚至 Pandas 数据帧相同的语法,即 split(some_column, some_character)[some_index]
.
示例:
df = (spark.createDataFrame(
(("shop", "Monday", "9:0-0:0"),
("shop", "Tuesday", "12:30-21:30")),
schema=("shopname", "day_of_week", "opening_hours")))
from pyspark.sql.functions import split
(df
.withColumn("opens", split(df.opening_hours, "-")[0])
.withColumn("closes", split(df.opening_hours, "-")[1])
.show()
)
+--------+-----------+-------------+-----+------+
|shopname|day_of_week|opening_hours|opens|closes|
+--------+-----------+-------------+-----+------+
| shop| Monday| 9:0-0:0| 9:0| 0:0|
| shop| Tuesday| 12:30-21:30|12:30| 21:30|
+--------+-----------+-------------+-----+------+
请注意,您的方法将留下 StringType() 的两列(我在此处添加的最后两列)。您可能会将这些转换为数字(例如自午夜以来的分钟数?),但随后您需要查看可能的负数,因为 "closes at 00:00" 实际上意味着在午夜之前关闭。无论如何,我会把它作为一个挑战。
这是这个问题的代码。我在网上查找了 Yelp 的数据集并在其上应用了解决方案。
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
import pyspark.sql.functions as f
from pyspark.sql.functions import expr,col,when,lit
import json
df1=spark.read.json(r"your_data_path")
df_mon=df1.select("business_id", "name",lit("Monday").alias("hourday"),when(col("hours.Monday").isNotNull(),f.split("hours.Monday",'-')[0]).alias("OpenHours"),when(col("hours.Monday").isNotNull(),f.split("hours.Monday",'-')[1]).alias("CloseHours"))
df_tue=df1.select("business_id", "name",lit("Tuesday").alias("hourday"),when(col("hours.Tuesday").isNotNull(),f.split("hours.Tuesday",'-')[0]).alias("OpenHours"),when(col("hours.Tuesday").isNotNull(),f.split("hours.Tuesday",'-')[1]).alias("CloseHours"))
df_wed=df1.select("business_id", "name",lit("Wednesday").alias("hourday"),when(col("hours.Wednesday").isNotNull(),f.split("hours.Wednesday",'-')[0]).alias("OpenHours"),when(col("hours.Wednesday").isNotNull(),f.split("hours.Wednesday",'-')[1]).alias("CloseHours"))
df_thu=df1.select("business_id", "name",lit("Thursday").alias("hourday"),when(col("hours.Thursday").isNotNull(),f.split("hours.Thursday",'-')[0]).alias("OpenHours"),when(col("hours.Thursday").isNotNull(),f.split("hours.Thursday",'-')[1]).alias("CloseHours"))
df_fri=df1.select("business_id", "name",lit("Friday").alias("hourday"),when(col("hours.Friday").isNotNull(),f.split("hours.Friday",'-')[0]).alias("OpenHours"),when(col("hours.Friday").isNotNull(),f.split("hours.Friday",'-')[1]).alias("CloseHours"))
df_sat=df1.select("business_id", "name",lit("Saturday").alias("hourday"),when(col("hours.Saturday").isNotNull(),f.split("hours.Saturday",'-')[0]).alias("OpenHours"),when(col("hours.Saturday").isNotNull(),f.split("hours.Saturday",'-')[1]).alias("CloseHours"))
df_sun=df1.select("business_id", "name",lit("Sunday").alias("hourday"),when(col("hours.Sunday").isNotNull(),f.split("hours.Sunday",'-')[0]).alias("OpenHours"),when(col("hours.Sunday").isNotNull(),f.split("hours.Sunday",'-')[1]).alias("CloseHours"))
df_final=df_mon.unionAll(df_tue).unionAll(df_wed).unionAll(df_thu).unionAll(df_fri).unionAll(df_sat).unionAll(df_sun)
df_final.show(10,False)
如果您对此有任何疑问,请告诉我。
Screenshot of the code
root
|-- address: string (nullable = true)
|-- attributes: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- business_id: string (nullable = true)
|-- categories: string (nullable = true)
|-- city: string (nullable = true)
|-- hours: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- is_open: long (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- name: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- review_count: long (nullable = true)
|-- stars: double (nullable = true)
|-- state: string (nullable = true)
我目前正在使用 Yelp 的数据集,我的 objective 是要查找一家企业营业的总时数 day/week。从数据中,我能够为某一天提取一个看起来像 [9:0, 0:0] 的时间范围。如何使用 pyspark 获取两列,一列用于显示 [9:0] 的开放时间,一列用于显示 [0:0]?
的关闭时间这是我用来在数据集中简单显示企业营业时间的一些代码。
import pyspark.sql.functions as f
from pyspark.sql.functions import expr
df_hours = df_MappedBusiness.select(
"business_id",
"name",
f.explode("hours").alias("hourDay","hourValue"),
f.split("hourValue", "[-]").alias("split_hours")
).show(50, truncate=False)
Expected Output
---------------
+----------------------------------------------------------------
|hourDay |hourValue |split_hours | open_hours | close_hours
+-----------------------------------------------------------------
|Monday |9:0-0:0 |[9:0, 0:0] | [9,0] | [0,0] |
调用 pyspark.sql.functions.split
后,您将创建 ArrayType
的 Column
(进一步保存字符串)。要访问嵌套列中的元素,您将使用与列表甚至 Pandas 数据帧相同的语法,即 split(some_column, some_character)[some_index]
.
示例:
df = (spark.createDataFrame(
(("shop", "Monday", "9:0-0:0"),
("shop", "Tuesday", "12:30-21:30")),
schema=("shopname", "day_of_week", "opening_hours")))
from pyspark.sql.functions import split
(df
.withColumn("opens", split(df.opening_hours, "-")[0])
.withColumn("closes", split(df.opening_hours, "-")[1])
.show()
)
+--------+-----------+-------------+-----+------+
|shopname|day_of_week|opening_hours|opens|closes|
+--------+-----------+-------------+-----+------+
| shop| Monday| 9:0-0:0| 9:0| 0:0|
| shop| Tuesday| 12:30-21:30|12:30| 21:30|
+--------+-----------+-------------+-----+------+
请注意,您的方法将留下 StringType() 的两列(我在此处添加的最后两列)。您可能会将这些转换为数字(例如自午夜以来的分钟数?),但随后您需要查看可能的负数,因为 "closes at 00:00" 实际上意味着在午夜之前关闭。无论如何,我会把它作为一个挑战。
这是这个问题的代码。我在网上查找了 Yelp 的数据集并在其上应用了解决方案。
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
import pyspark.sql.functions as f
from pyspark.sql.functions import expr,col,when,lit
import json
df1=spark.read.json(r"your_data_path")
df_mon=df1.select("business_id", "name",lit("Monday").alias("hourday"),when(col("hours.Monday").isNotNull(),f.split("hours.Monday",'-')[0]).alias("OpenHours"),when(col("hours.Monday").isNotNull(),f.split("hours.Monday",'-')[1]).alias("CloseHours"))
df_tue=df1.select("business_id", "name",lit("Tuesday").alias("hourday"),when(col("hours.Tuesday").isNotNull(),f.split("hours.Tuesday",'-')[0]).alias("OpenHours"),when(col("hours.Tuesday").isNotNull(),f.split("hours.Tuesday",'-')[1]).alias("CloseHours"))
df_wed=df1.select("business_id", "name",lit("Wednesday").alias("hourday"),when(col("hours.Wednesday").isNotNull(),f.split("hours.Wednesday",'-')[0]).alias("OpenHours"),when(col("hours.Wednesday").isNotNull(),f.split("hours.Wednesday",'-')[1]).alias("CloseHours"))
df_thu=df1.select("business_id", "name",lit("Thursday").alias("hourday"),when(col("hours.Thursday").isNotNull(),f.split("hours.Thursday",'-')[0]).alias("OpenHours"),when(col("hours.Thursday").isNotNull(),f.split("hours.Thursday",'-')[1]).alias("CloseHours"))
df_fri=df1.select("business_id", "name",lit("Friday").alias("hourday"),when(col("hours.Friday").isNotNull(),f.split("hours.Friday",'-')[0]).alias("OpenHours"),when(col("hours.Friday").isNotNull(),f.split("hours.Friday",'-')[1]).alias("CloseHours"))
df_sat=df1.select("business_id", "name",lit("Saturday").alias("hourday"),when(col("hours.Saturday").isNotNull(),f.split("hours.Saturday",'-')[0]).alias("OpenHours"),when(col("hours.Saturday").isNotNull(),f.split("hours.Saturday",'-')[1]).alias("CloseHours"))
df_sun=df1.select("business_id", "name",lit("Sunday").alias("hourday"),when(col("hours.Sunday").isNotNull(),f.split("hours.Sunday",'-')[0]).alias("OpenHours"),when(col("hours.Sunday").isNotNull(),f.split("hours.Sunday",'-')[1]).alias("CloseHours"))
df_final=df_mon.unionAll(df_tue).unionAll(df_wed).unionAll(df_thu).unionAll(df_fri).unionAll(df_sat).unionAll(df_sun)
df_final.show(10,False)
如果您对此有任何疑问,请告诉我。