组合具有不同列数的 Spark 数据帧
Combining Spark Data Frames with Different Number of Columns
在 问题中,我问过如何将 PySpark 数据帧与不同列数组合起来。给出的答案要求每个数据框必须具有相同数量的列才能将它们全部组合起来:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder\
.appName("DynamicFrame")\
.getOrCreate()
df01 = spark.createDataFrame([(1, 2, 3), (9, 5, 6)], ("C1", "C2", "C3"))
df02 = spark.createDataFrame([(11,12, 13), (10, 15, 16)], ("C2", "C3", "C4"))
df03 = spark.createDataFrame([(111,112), (110, 115)], ("C1", "C4"))
dataframes = [df01, df02, df03]
# Create a list of all the column names and sort them
cols = set()
for df in dataframes:
for x in df.columns:
cols.add(x)
cols = sorted(cols)
# Create a dictionary with all the dataframes
dfs = {}
for i, d in enumerate(dataframes):
new_name = 'df' + str(i) # New name for the key, the dataframe is the value
dfs[new_name] = d
# Loop through all column names. Add the missing columns to the dataframe (with value 0)
for x in cols:
if x not in d.columns:
dfs[new_name] = dfs[new_name].withColumn(x, lit(0))
dfs[new_name] = dfs[new_name].select(cols) # Use 'select' to get the columns sorted
# Now put it al together with a loop (union)
result = dfs['df0'] # Take the first dataframe, add the others to it
dfs_to_add = dfs.keys() # List of all the dataframes in the dictionary
dfs_to_add.remove('df0') # Remove the first one, because it is already in the result
for x in dfs_to_add:
result = result.union(dfs[x])
result.show()
有没有什么方法可以合并 PySpark 数据帧而不必确保所有数据帧都具有相同的列数?我问的原因是100个数据框合并需要2天左右,但是使用上面的代码过程超时。
df = df1.unionByName(df2, allowMissingColumns=True)
在
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder\
.appName("DynamicFrame")\
.getOrCreate()
df01 = spark.createDataFrame([(1, 2, 3), (9, 5, 6)], ("C1", "C2", "C3"))
df02 = spark.createDataFrame([(11,12, 13), (10, 15, 16)], ("C2", "C3", "C4"))
df03 = spark.createDataFrame([(111,112), (110, 115)], ("C1", "C4"))
dataframes = [df01, df02, df03]
# Create a list of all the column names and sort them
cols = set()
for df in dataframes:
for x in df.columns:
cols.add(x)
cols = sorted(cols)
# Create a dictionary with all the dataframes
dfs = {}
for i, d in enumerate(dataframes):
new_name = 'df' + str(i) # New name for the key, the dataframe is the value
dfs[new_name] = d
# Loop through all column names. Add the missing columns to the dataframe (with value 0)
for x in cols:
if x not in d.columns:
dfs[new_name] = dfs[new_name].withColumn(x, lit(0))
dfs[new_name] = dfs[new_name].select(cols) # Use 'select' to get the columns sorted
# Now put it al together with a loop (union)
result = dfs['df0'] # Take the first dataframe, add the others to it
dfs_to_add = dfs.keys() # List of all the dataframes in the dictionary
dfs_to_add.remove('df0') # Remove the first one, because it is already in the result
for x in dfs_to_add:
result = result.union(dfs[x])
result.show()
有没有什么方法可以合并 PySpark 数据帧而不必确保所有数据帧都具有相同的列数?我问的原因是100个数据框合并需要2天左右,但是使用上面的代码过程超时。
df = df1.unionByName(df2, allowMissingColumns=True)