如何使用一组 SQL 表达式将列添加到 SparkDataFrame?
How to add column to SparkDataFrame using a set of SQL expressions?
我正在使用 spark R,我想根据现有列的字符串修改将列添加到 SparkDataFrame。考虑以下 SparkDataFrame:
head(df)
id address
1 street_X, postal_code_X, neighborhood_X, county_name_X
2 neighborhood_Y, county_name_Y
3 postal_code_Z, neighborhood_Z, county_name_Z
我需要添加一个仅包含邻域的列。我设法按如下方式将此列提取到新的 SparkDataFrame 中:
new_df <- selectExpr(df, "SUBSTRING_INDEX(address, ',', -2) AS neighborhood")
new_df <- selectExpr(new_df, "SUBSTRING_INDEX(neighborhood, ',', 1) AS neighborhood")
head(new_df)
neighborhood
neighborhood_X
neighborhood_Y
neighborhood_Z
但是如何将此列邻域添加到原始 df(相当于 R 中的 cbind/我检查了 withColumn,但没有设法将其与 selectExpr 结合)?
尝试这样的事情
只有 select 其他列
new_df <- selectExpr(df, "id", "address",
"SUBSTRING_INDEX(SUBSTRING_INDEX(address, ',', -2), ',', 1) AS neighborhood")
这也有可能
new_df <- selectExpr(df, "*",
"SUBSTRING_INDEX(SUBSTRING_INDEX(address, ',', -2), ',', 1) AS neighborhood")
我正在使用 spark R,我想根据现有列的字符串修改将列添加到 SparkDataFrame。考虑以下 SparkDataFrame:
head(df)
id address
1 street_X, postal_code_X, neighborhood_X, county_name_X
2 neighborhood_Y, county_name_Y
3 postal_code_Z, neighborhood_Z, county_name_Z
我需要添加一个仅包含邻域的列。我设法按如下方式将此列提取到新的 SparkDataFrame 中:
new_df <- selectExpr(df, "SUBSTRING_INDEX(address, ',', -2) AS neighborhood")
new_df <- selectExpr(new_df, "SUBSTRING_INDEX(neighborhood, ',', 1) AS neighborhood")
head(new_df)
neighborhood
neighborhood_X
neighborhood_Y
neighborhood_Z
但是如何将此列邻域添加到原始 df(相当于 R 中的 cbind/我检查了 withColumn,但没有设法将其与 selectExpr 结合)?
尝试这样的事情
只有 select 其他列
new_df <- selectExpr(df, "id", "address",
"SUBSTRING_INDEX(SUBSTRING_INDEX(address, ',', -2), ',', 1) AS neighborhood")
这也有可能
new_df <- selectExpr(df, "*",
"SUBSTRING_INDEX(SUBSTRING_INDEX(address, ',', -2), ',', 1) AS neighborhood")