使用 SPARK SQL 删除重复项

Delete Duplicate using SPARK SQL

以下代码在 Databricks Spark 中运行良好SQL

with CTE1 as
(
 select *,
        row_number()over(Partition by ID order by Name) as r
 from Emp
)
select * from CTE1 where r>1

但对于 DELETE 语句:

with CTE1 as
(
 select *,
        row_number()over(Partition by ID order by Name) as r
 from Emp
)

DELETE from CTE1 where r>1

SQL 语句中有一个错误:

Analysis exception: Table Not found Emp

您想要的语法仅在 SQL 服务器中可用。假设 Name 是唯一的而不是 NULL,您可以使用替代方法,例如:

delete from emp
    where name > (select min(emp2.name)
                  from emp emp2
                  where emp2.id = emp.id
                 );

否则,使用table的主键进行比较。

这里有一个简单的方法,不用删除,只需select你想要的

with CTE1 as
(
 select *,
        row_number()over(Partition by ID order by Name) as r
 from Emp
) where r=1