使 Spark 代码更高效、更简洁
Make a Spark code more efficient and cleaner
我有以下代码可以清理 returns 具有两列的数据框的文档语料库 (pipelineClean(corpus)
):
- “id”:长
- “令牌”:数组[字符串]。
之后,代码生成一个包含以下列的数据框:
- “术语”:字符串
- "postingList": List[Array[Long, Long]](第一个 long 是文档中记录的另一个术语频率)
pipelineClean(corpus)
.select($"id" as "documentId", explode($"tokens") as "term") // explode creates a new row for each element in the given array column
.groupBy("term", "documentId").count //group by and then count number of rows per group, returning a df with groupings and the counting
.where($"term" =!= "") // seems like there are some tokens that are empty, even though Tokenizer should remove them
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")
.groupBy("term").agg(collect_list($"posting") as "postingList") // we do another grouping in order to collect the postings into a list
.orderBy("term")
.persist(StorageLevel.MEMORY_ONLY_SER)
我的问题是:是否可以让这段代码更短 and/or 更有效率?例如,是否可以在单个 groupBy
?
中进行分组
除了跳过 withColumn
调用并直接使用 select:
之外,您似乎无能为力。
.select(col("term"), struct(col("documentId"), col("count")) as "posting")
而不是
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")
我有以下代码可以清理 returns 具有两列的数据框的文档语料库 (pipelineClean(corpus)
):
- “id”:长
- “令牌”:数组[字符串]。
之后,代码生成一个包含以下列的数据框:
- “术语”:字符串
- "postingList": List[Array[Long, Long]](第一个 long 是文档中记录的另一个术语频率)
pipelineClean(corpus)
.select($"id" as "documentId", explode($"tokens") as "term") // explode creates a new row for each element in the given array column
.groupBy("term", "documentId").count //group by and then count number of rows per group, returning a df with groupings and the counting
.where($"term" =!= "") // seems like there are some tokens that are empty, even though Tokenizer should remove them
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")
.groupBy("term").agg(collect_list($"posting") as "postingList") // we do another grouping in order to collect the postings into a list
.orderBy("term")
.persist(StorageLevel.MEMORY_ONLY_SER)
我的问题是:是否可以让这段代码更短 and/or 更有效率?例如,是否可以在单个 groupBy
?
除了跳过 withColumn
调用并直接使用 select:
.select(col("term"), struct(col("documentId"), col("count")) as "posting")
而不是
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")