SparkSQL 查询数据框
SparkSQL query dataframe
我将 pandas 数据帧转换为 spark sql table。我是 SQL 的新手,想 select 来自 table 的密钥 'code'。
查询
sqlContext.sql("""SELECT `classification` FROM psyc""").show()
查询响应
+--------------------+
| classification|
+--------------------+
|[{'code': '3297',...|
|[{'code': '3410',...|
|[{'code': '3410',...|
|[{'code': '2227',...|
|[{'code': '3410',...|
+--------------------+
我怎样才能select 密钥'code'。该列包含包含数据的字典列表。
sqlContext.sql("""SELECT `classification.code` FROM psyc""").show() # this query does not work
这是剩余的代码
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession \
.builder \
.appName("Python Spark SQL ") \
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
fp = os.path.join(BASE_DIR,'psyc.csv')
df = spark.read.csv(fp,header=True)
df.printSchema()
df.createOrReplaceTempView("psyc")
这将创建具有以下架构的 table
试试这个
df.select(F.explode("classification").alias("classification")).select("classification.code").show()
字段classification
是string类型,首先要转成struct类型,然后直接select为classification.code
。要从字符串转换为结构,请尝试以下操作。
//Sample Dataframe
from pyspark.sql.types import *
df=spark.createDataFrame([(1,"[{'code':'1234','name':'manoj'},{'code':'124','name':'kumar'},{'code':'4567','name':'dhakad'}]",),(2,"[{'code':'97248','name':'joe'},{'code':'2424','name':'alice'},{'code':'464','name':'bob'}]",)],["id","classification",])
//df will be below
+---+--------------------+
| id| classification|
+---+--------------------+
| 1|[{'code':'1234','...|
| 2|[{'code':'97248',...|
+---+--------------------+
//here is schema of above df
root
|-- id: long (nullable = true)
|-- classification: string (nullable = true)
//df after converting classification column to the struct type and selecting only code.
schema = ArrayType(StructType([StructField('code', StringType()), StructField('name', StringType())]))
df1=df.withColumn('classification',from_json(col("classification"),schema=schema))
df2=df1.withColumn("code",col("classification.code"))
+---+--------------------+------------------+
| id| classification| code|
+---+--------------------+------------------+
| 1|[[1234,manoj], [1...| [1234, 124, 4567]|
| 2|[[97248,joe], [24...|[97248, 2424, 464]|
+---+--------------------+------------------+
//Here, I am going to select id and while exploding code column
df3=df2.select(col("id"),explode(col("code")))
df3.show()
//df3 output
+---+-----+
| id| col|
+---+-----+
| 1| 1234|
| 1| 124|
| 1| 4567|
| 2|97248|
| 2| 2424|
| 2| 464|
+---+-----+
我将 pandas 数据帧转换为 spark sql table。我是 SQL 的新手,想 select 来自 table 的密钥 'code'。
查询
sqlContext.sql("""SELECT `classification` FROM psyc""").show()
查询响应
+--------------------+
| classification|
+--------------------+
|[{'code': '3297',...|
|[{'code': '3410',...|
|[{'code': '3410',...|
|[{'code': '2227',...|
|[{'code': '3410',...|
+--------------------+
我怎样才能select 密钥'code'。该列包含包含数据的字典列表。
sqlContext.sql("""SELECT `classification.code` FROM psyc""").show() # this query does not work
这是剩余的代码
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession \
.builder \
.appName("Python Spark SQL ") \
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
fp = os.path.join(BASE_DIR,'psyc.csv')
df = spark.read.csv(fp,header=True)
df.printSchema()
df.createOrReplaceTempView("psyc")
这将创建具有以下架构的 table
试试这个
df.select(F.explode("classification").alias("classification")).select("classification.code").show()
字段classification
是string类型,首先要转成struct类型,然后直接select为classification.code
。要从字符串转换为结构,请尝试以下操作。
//Sample Dataframe
from pyspark.sql.types import *
df=spark.createDataFrame([(1,"[{'code':'1234','name':'manoj'},{'code':'124','name':'kumar'},{'code':'4567','name':'dhakad'}]",),(2,"[{'code':'97248','name':'joe'},{'code':'2424','name':'alice'},{'code':'464','name':'bob'}]",)],["id","classification",])
//df will be below
+---+--------------------+
| id| classification|
+---+--------------------+
| 1|[{'code':'1234','...|
| 2|[{'code':'97248',...|
+---+--------------------+
//here is schema of above df
root
|-- id: long (nullable = true)
|-- classification: string (nullable = true)
//df after converting classification column to the struct type and selecting only code.
schema = ArrayType(StructType([StructField('code', StringType()), StructField('name', StringType())]))
df1=df.withColumn('classification',from_json(col("classification"),schema=schema))
df2=df1.withColumn("code",col("classification.code"))
+---+--------------------+------------------+
| id| classification| code|
+---+--------------------+------------------+
| 1|[[1234,manoj], [1...| [1234, 124, 4567]|
| 2|[[97248,joe], [24...|[97248, 2424, 464]|
+---+--------------------+------------------+
//Here, I am going to select id and while exploding code column
df3=df2.select(col("id"),explode(col("code")))
df3.show()
//df3 output
+---+-----+
| id| col|
+---+-----+
| 1| 1234|
| 1| 124|
| 1| 4567|
| 2|97248|
| 2| 2424|
| 2| 464|
+---+-----+