Pyspark 拆分函数数小时

Pyspark Split Function for Hours

Screenshot of the code

 root
  |-- address: string (nullable = true)
  |-- attributes: map (nullable = true)
  |    |-- key: string
  |    |-- value: string (valueContainsNull = true)
  |-- business_id: string (nullable = true)
  |-- categories: string (nullable = true)
  |-- city: string (nullable = true)
  |-- hours: map (nullable = true)
  |    |-- key: string
  |    |-- value: string (valueContainsNull = true)
  |-- is_open: long (nullable = true)
  |-- latitude: double (nullable = true)
  |-- longitude: double (nullable = true)
  |-- name: string (nullable = true)
  |-- postal_code: string (nullable = true)
  |-- review_count: long (nullable = true)
  |-- stars: double (nullable = true)
  |-- state: string (nullable = true)

我目前正在使用 Yelp 的数据集,我的 objective 是要查找一家企业营业的总时数 day/week。从数据中,我能够为某一天提取一个看起来像 [9:0, 0:0] 的时间范围。如何使用 pyspark 获取两列,一列用于显示 [9:0] 的开放时间,一列用于显示 [0:0]?

的关闭时间

这是我用来在数据集中简单显示企业营业时间的一些代码。

import pyspark.sql.functions as f
from pyspark.sql.functions import expr

df_hours = df_MappedBusiness.select(
    "business_id",
    "name",
    f.explode("hours").alias("hourDay","hourValue"), 
    f.split("hourValue", "[-]").alias("split_hours")
).show(50, truncate=False)


Expected Output
---------------

+---------------------------------------------------------------- 
|hourDay  |hourValue  |split_hours   | open_hours   | close_hours
+-----------------------------------------------------------------
|Monday   |9:0-0:0    |[9:0, 0:0]    | [9,0]        | [0,0]       |

调用 pyspark.sql.functions.split 后,您将创建 ArrayTypeColumn(进一步保存字符串)。要访问嵌套列中的元素,您将使用与列表甚至 Pandas 数据帧相同的语法,即 split(some_column, some_character)[some_index].

示例:

df = (spark.createDataFrame(
    (("shop", "Monday", "9:0-0:0"),
     ("shop", "Tuesday", "12:30-21:30")),
    schema=("shopname", "day_of_week", "opening_hours")))

from pyspark.sql.functions import split

(df
 .withColumn("opens", split(df.opening_hours, "-")[0])
 .withColumn("closes", split(df.opening_hours, "-")[1])
 .show()
 )

+--------+-----------+-------------+-----+------+
|shopname|day_of_week|opening_hours|opens|closes|
+--------+-----------+-------------+-----+------+
|    shop|     Monday|      9:0-0:0|  9:0|   0:0|
|    shop|    Tuesday|  12:30-21:30|12:30| 21:30|
+--------+-----------+-------------+-----+------+

请注意,您的方法将留下 StringType() 的两列(我在此处添加的最后两列)。您可能会将这些转换为数字(例如自午夜以来的分钟数?),但随后您需要查看可能的负数,因为 "closes at 00:00" 实际上意味着在午夜之前关闭。无论如何,我会把它作为一个挑战。

这是这个问题的代码。我在网上查找了 Yelp 的数据集并在其上应用了解决方案。

from pyspark.sql import  SparkSession

spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
import pyspark.sql.functions as f
from pyspark.sql.functions import expr,col,when,lit
import json

df1=spark.read.json(r"your_data_path")



df_mon=df1.select("business_id", "name",lit("Monday").alias("hourday"),when(col("hours.Monday").isNotNull(),f.split("hours.Monday",'-')[0]).alias("OpenHours"),when(col("hours.Monday").isNotNull(),f.split("hours.Monday",'-')[1]).alias("CloseHours"))
df_tue=df1.select("business_id", "name",lit("Tuesday").alias("hourday"),when(col("hours.Tuesday").isNotNull(),f.split("hours.Tuesday",'-')[0]).alias("OpenHours"),when(col("hours.Tuesday").isNotNull(),f.split("hours.Tuesday",'-')[1]).alias("CloseHours"))
df_wed=df1.select("business_id", "name",lit("Wednesday").alias("hourday"),when(col("hours.Wednesday").isNotNull(),f.split("hours.Wednesday",'-')[0]).alias("OpenHours"),when(col("hours.Wednesday").isNotNull(),f.split("hours.Wednesday",'-')[1]).alias("CloseHours"))
df_thu=df1.select("business_id", "name",lit("Thursday").alias("hourday"),when(col("hours.Thursday").isNotNull(),f.split("hours.Thursday",'-')[0]).alias("OpenHours"),when(col("hours.Thursday").isNotNull(),f.split("hours.Thursday",'-')[1]).alias("CloseHours"))
df_fri=df1.select("business_id", "name",lit("Friday").alias("hourday"),when(col("hours.Friday").isNotNull(),f.split("hours.Friday",'-')[0]).alias("OpenHours"),when(col("hours.Friday").isNotNull(),f.split("hours.Friday",'-')[1]).alias("CloseHours"))
df_sat=df1.select("business_id", "name",lit("Saturday").alias("hourday"),when(col("hours.Saturday").isNotNull(),f.split("hours.Saturday",'-')[0]).alias("OpenHours"),when(col("hours.Saturday").isNotNull(),f.split("hours.Saturday",'-')[1]).alias("CloseHours"))
df_sun=df1.select("business_id", "name",lit("Sunday").alias("hourday"),when(col("hours.Sunday").isNotNull(),f.split("hours.Sunday",'-')[0]).alias("OpenHours"),when(col("hours.Sunday").isNotNull(),f.split("hours.Sunday",'-')[1]).alias("CloseHours"))

df_final=df_mon.unionAll(df_tue).unionAll(df_wed).unionAll(df_thu).unionAll(df_fri).unionAll(df_sat).unionAll(df_sun)

df_final.show(10,False)

如果您对此有任何疑问,请告诉我。