如何使用 Databricks Labs Smolder 从消息段中获取所有数据
How to get all data from a message segment with Databricks Labs Smolder
例如,我知道您可以使用提供的辅助函数从消息段中检索子字段
val nameDf = df.select(segment_field("PID", 4).alias("name"))
有没有一种方法可以提取整个消息段(“PID”),而不必为每个子字段添加索引?
PID|||d40726da-9b7a-49eb-9eeb-e406708bbb60||Heller^Keneth||||||140 Pacocha Way Suite 52^^Northampton^Massachusetts^^USA
segment_field
已实现 here 是一种在给定索引的情况下提取单个值的辅助方法。
/**
* Extracts a field from a message segment.
*
* @param segment The ID of the segment to extract.
* @param field The index of the field to extract.
* @param segmentColumn The name of the column containing message segments.
* Defaults to "segments".
* @return Yields a new column containing the field of a message segment.
*
* @note If there are multiple segments with the same ID, this function will
* select the field from one of the segments. Order is undefined.
*/
def segment_field(segment: String,
field: Int,
segmentColumn: Column = col("segments")): Column = {
filter(segmentColumn, s => s("id") === lit(segment))
.getItem(0)
.getField("fields")
.getItem(field)
}
但是,如果您有兴趣提取所有字段值而不考虑索引,您可以将此行为复制为
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def segment_fields(segment: String,
segmentColumn: Column = col("segments")): Column = {
filter(segmentColumn, s => s("id") === lit(segment))
.getItem(0)
.getField("fields")
}
并这样使用
val nameDf = df.select(segment_fields("PID").alias("values"))
然后您可以根据需要提取或转换数据。
例如,我知道您可以使用提供的辅助函数从消息段中检索子字段
val nameDf = df.select(segment_field("PID", 4).alias("name"))
有没有一种方法可以提取整个消息段(“PID”),而不必为每个子字段添加索引?
PID|||d40726da-9b7a-49eb-9eeb-e406708bbb60||Heller^Keneth||||||140 Pacocha Way Suite 52^^Northampton^Massachusetts^^USA
segment_field
已实现 here 是一种在给定索引的情况下提取单个值的辅助方法。
/**
* Extracts a field from a message segment.
*
* @param segment The ID of the segment to extract.
* @param field The index of the field to extract.
* @param segmentColumn The name of the column containing message segments.
* Defaults to "segments".
* @return Yields a new column containing the field of a message segment.
*
* @note If there are multiple segments with the same ID, this function will
* select the field from one of the segments. Order is undefined.
*/
def segment_field(segment: String,
field: Int,
segmentColumn: Column = col("segments")): Column = {
filter(segmentColumn, s => s("id") === lit(segment))
.getItem(0)
.getField("fields")
.getItem(field)
}
但是,如果您有兴趣提取所有字段值而不考虑索引,您可以将此行为复制为
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def segment_fields(segment: String,
segmentColumn: Column = col("segments")): Column = {
filter(segmentColumn, s => s("id") === lit(segment))
.getItem(0)
.getField("fields")
}
并这样使用
val nameDf = df.select(segment_fields("PID").alias("values"))
然后您可以根据需要提取或转换数据。