Pyspark code error: Invalid argument, not a string or column
Pyspark code error: Invalid argument, not a string or column
我是 运行 这段代码,我只想返回一些列,而不是所有参与连接的表的所有列。
df_final = df.join(df1,(df['sbr_brand']==df1['sbr_brand'])\
&(df['sbr_number']==df1['sbr_number'])\
&(df['calendar_date']==df1['calendar_date'])\
&(df['check_number']==df1['check_number']))\
.join(df2,(df['sbr_brand']==df2['brand'])\
&(df['sbr_number']==df2['store_number'])\
&(df['calendar_date']==df2['date_of_business'])\
&(df['check_number']==df2['check_number']),'inner')\
.select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])
我有一个错误:
Invalid argument, not a string or column: DataFrame[check_line_number: bigint, item_barcode: string, dining_option: string, item_quantity: double, item_gross_amount: decimal(38,6), item_net_amount: decimal(38,6)] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
我把底部的select语句和代码运行完美去掉了。
然后我 运行 下面的命令,它显示了所有 3 个数据帧的所有列。
display(df_final)
我还运行一个单独的命令,看看它是否有所作为:
df_final2 = df_final.select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])
但是我得到了同样的错误。不知道如何解决这个问题。请指教
尝试使用以下内容 -
示例输入数据帧
df1 = spark.createDataFrame(data=[(1,1,3),(2,1,1),(2,2,3),(1,2,3),(1,2,1)], schema=['id1', 'id2', 'value'])
df2 = spark.createDataFrame(data=[(1,1,3),(2,1,1),(2,2,3),(1,2,3),(1,2,1)], schema=['id1', 'id2', 'value'])
输出(使用连接)
df1.join(df2, (df1["id1"] == df2["id1"]) & (df1["id2"] == df2["id2"]) & (df1["value"] == df2["value"])).select(df1["id1"], df2["id2"], df1["value"]).show()
+---+---+-----+
|id1|id2|value|
+---+---+-----+
| 1| 1| 3|
| 1| 2| 1|
| 1| 2| 3|
| 2| 1| 1|
| 2| 2| 3|
+---+---+-----+
df['col1']
return一个Column
,而df['col1', 'col2']
return一个DataFrame
.
select
函数的参数必须是字符串或列。所以应该是:
df_final2 = df_final.select(df['modifier_gross_amount'],df1['check_line_number'],df1['item_barcode'],df1['dining_option'],df1['item_quantity'],df1['item_gross_amount'],df1['item_net_amount'],df2['brand_id'])
我是 运行 这段代码,我只想返回一些列,而不是所有参与连接的表的所有列。
df_final = df.join(df1,(df['sbr_brand']==df1['sbr_brand'])\
&(df['sbr_number']==df1['sbr_number'])\
&(df['calendar_date']==df1['calendar_date'])\
&(df['check_number']==df1['check_number']))\
.join(df2,(df['sbr_brand']==df2['brand'])\
&(df['sbr_number']==df2['store_number'])\
&(df['calendar_date']==df2['date_of_business'])\
&(df['check_number']==df2['check_number']),'inner')\
.select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])
我有一个错误:
Invalid argument, not a string or column: DataFrame[check_line_number: bigint, item_barcode: string, dining_option: string, item_quantity: double, item_gross_amount: decimal(38,6), item_net_amount: decimal(38,6)] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
我把底部的select语句和代码运行完美去掉了。 然后我 运行 下面的命令,它显示了所有 3 个数据帧的所有列。
display(df_final)
我还运行一个单独的命令,看看它是否有所作为:
df_final2 = df_final.select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])
但是我得到了同样的错误。不知道如何解决这个问题。请指教
尝试使用以下内容 -
示例输入数据帧
df1 = spark.createDataFrame(data=[(1,1,3),(2,1,1),(2,2,3),(1,2,3),(1,2,1)], schema=['id1', 'id2', 'value'])
df2 = spark.createDataFrame(data=[(1,1,3),(2,1,1),(2,2,3),(1,2,3),(1,2,1)], schema=['id1', 'id2', 'value'])
输出(使用连接)
df1.join(df2, (df1["id1"] == df2["id1"]) & (df1["id2"] == df2["id2"]) & (df1["value"] == df2["value"])).select(df1["id1"], df2["id2"], df1["value"]).show()
+---+---+-----+
|id1|id2|value|
+---+---+-----+
| 1| 1| 3|
| 1| 2| 1|
| 1| 2| 3|
| 2| 1| 1|
| 2| 2| 3|
+---+---+-----+
df['col1']
return一个Column
,而df['col1', 'col2']
return一个DataFrame
.
select
函数的参数必须是字符串或列。所以应该是:
df_final2 = df_final.select(df['modifier_gross_amount'],df1['check_line_number'],df1['item_barcode'],df1['dining_option'],df1['item_quantity'],df1['item_gross_amount'],df1['item_net_amount'],df2['brand_id'])