将包含字典列表的数据框转换为 pyspark 中的多行
Convert a dataframe containing a list of dictionaries to a several rows in pyspark
我有以下问题,我有一个包含两列和字典列表的数据框。我为我拥有的数据结构创建的方案如下:
tick_by_tick_schema = StructType([
StructField('localSymbol', StringType()),
StructField('tickByTicks', ArrayType(StructType([
StructField('price', StringType()),
StructField('size', StringType()),
StructField('specialConditions', StringType()),
]))),
StructField('domBids', ArrayType(StructType([
StructField('price_bid', StringType()),
StructField('size_bid', StringType()),
StructField('marketMaker_bid', StringType()),
])))
])
我的数据框是这样的:
+-----------+----------------+----------------------------------------------------------------------------------------+
|localSymbol|tickByTicks |domBids |
+-----------+----------------+----------------------------------------------------------------------------------------+
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|
+-----------+----------------+----------------------------------------------------------------------------------------+
现在我想得到的是这样的:
+-----------+----------------+----------------------------------------------------------------------------------------+---------+---------------+-----+
|localSymbol|tickByTicks |domBids |price_bid|marketMaker_bid|price|
+-----------+----------------+----------------------------------------------------------------------------------------+---------+---------------+-----+
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.8 |CHX |32.99|
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.8 |MEMX |32.99|
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.8 |NYSENAT |32.99|
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.79 |NSDQ |32.99|
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.69 |BYX |32.99|
+-----------+----------------+----------------------------------------------------------------------------------------+---------+---------------+-----+
我试过了,但显然不行 xD
df = self.tick_by_tick_data_processed.select(f.col('localSymbol'),f.col('tickByTicks'),f.col('domBids'))\
.withColumn('price_bid', f.explode(f.col('tickByTicks.price'))) \
.withColumn('marketMaker_bid', f.explode(f.col('domBids.marketMaker_bid'))) \
.withColumn('price_bid', f.explode(f.col('domBids.price_bid')))
这可能有效:
# It explodes and select all struct columns
df = self.tick_by_tick_data_processed \
.withColumn('tick', f.explode(f.col('tickByTicks'))) \
.withColumn('dom', f.explode(f.col('domBids'))) \
.select('localSymbol', 'tick.*', 'dom.*')
# OR
# Selecting only desired columns
df = self.tick_by_tick_data_processed \
.withColumn('tick', f.explode(f.col('tickByTicks'))) \
.withColumn('dom', f.explode(f.col('domBids'))) \
.select('localSymbol',
f.col('tick.price').alias('tick_price'),
f.col('dom.marketMaker_bid').alias('marketMaker_bid'),
f.col('dom.price_bid').alias('price_bid'))
我有以下问题,我有一个包含两列和字典列表的数据框。我为我拥有的数据结构创建的方案如下:
tick_by_tick_schema = StructType([
StructField('localSymbol', StringType()),
StructField('tickByTicks', ArrayType(StructType([
StructField('price', StringType()),
StructField('size', StringType()),
StructField('specialConditions', StringType()),
]))),
StructField('domBids', ArrayType(StructType([
StructField('price_bid', StringType()),
StructField('size_bid', StringType()),
StructField('marketMaker_bid', StringType()),
])))
])
我的数据框是这样的:
+-----------+----------------+----------------------------------------------------------------------------------------+
|localSymbol|tickByTicks |domBids |
+-----------+----------------+----------------------------------------------------------------------------------------+
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|
+-----------+----------------+----------------------------------------------------------------------------------------+
现在我想得到的是这样的:
+-----------+----------------+----------------------------------------------------------------------------------------+---------+---------------+-----+
|localSymbol|tickByTicks |domBids |price_bid|marketMaker_bid|price|
+-----------+----------------+----------------------------------------------------------------------------------------+---------+---------------+-----+
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.8 |CHX |32.99|
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.8 |MEMX |32.99|
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.8 |NYSENAT |32.99|
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.79 |NSDQ |32.99|
|ALKT |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.69 |BYX |32.99|
+-----------+----------------+----------------------------------------------------------------------------------------+---------+---------------+-----+
我试过了,但显然不行 xD
df = self.tick_by_tick_data_processed.select(f.col('localSymbol'),f.col('tickByTicks'),f.col('domBids'))\
.withColumn('price_bid', f.explode(f.col('tickByTicks.price'))) \
.withColumn('marketMaker_bid', f.explode(f.col('domBids.marketMaker_bid'))) \
.withColumn('price_bid', f.explode(f.col('domBids.price_bid')))
这可能有效:
# It explodes and select all struct columns
df = self.tick_by_tick_data_processed \
.withColumn('tick', f.explode(f.col('tickByTicks'))) \
.withColumn('dom', f.explode(f.col('domBids'))) \
.select('localSymbol', 'tick.*', 'dom.*')
# OR
# Selecting only desired columns
df = self.tick_by_tick_data_processed \
.withColumn('tick', f.explode(f.col('tickByTicks'))) \
.withColumn('dom', f.explode(f.col('domBids'))) \
.select('localSymbol',
f.col('tick.price').alias('tick_price'),
f.col('dom.marketMaker_bid').alias('marketMaker_bid'),
f.col('dom.price_bid').alias('price_bid'))