根据列的子集过滤掉重复的行
Filter out duplicate rows based on a subset of columns
我有一些数据如下所示:
ID,DateTime,Category,SubCategory
X01,2014-02-13T12:36:14,Clothes,Tshirts
X01,2014-02-13T12:37:16,Clothes,Tshirts
X01,2014-02-13T12:38:33,Shoes,Running
X02,2014-02-13T12:39:23,Shoes,Running
X02,2014-02-13T12:40:42,Books,Fiction
X02,2014-02-13T12:41:04,Books,Fiction
我想做的是像这样及时保留每个数据点的一个实例(我不关心哪个实例及时):
ID,DateTime,Category,SubCategory
X01,2014-02-13T12:36:14,Clothes,Tshirts
X02,2014-02-13T12:39:23,Shoes,Running
X02,2014-02-13T12:40:42,Books,Fiction
不幸的是,根据 Hive Language Manual,Hive 的 DISTINCT
表达式对整个 table 有效,所以这样做不是一种选择:
SELECT DISTINCT(ID, SubCategory),
DateTime,
Category
FROM sometable
如何获得上面的第二个 table?提前致谢!
SQL 中这种事情的常用方法是按以下方式分组:
select ID, category, subcategory, min(datetime) datetime
from sometable
group by ID, category, subcategory
我有一些数据如下所示:
ID,DateTime,Category,SubCategory
X01,2014-02-13T12:36:14,Clothes,Tshirts
X01,2014-02-13T12:37:16,Clothes,Tshirts
X01,2014-02-13T12:38:33,Shoes,Running
X02,2014-02-13T12:39:23,Shoes,Running
X02,2014-02-13T12:40:42,Books,Fiction
X02,2014-02-13T12:41:04,Books,Fiction
我想做的是像这样及时保留每个数据点的一个实例(我不关心哪个实例及时):
ID,DateTime,Category,SubCategory
X01,2014-02-13T12:36:14,Clothes,Tshirts
X02,2014-02-13T12:39:23,Shoes,Running
X02,2014-02-13T12:40:42,Books,Fiction
不幸的是,根据 Hive Language Manual,Hive 的 DISTINCT
表达式对整个 table 有效,所以这样做不是一种选择:
SELECT DISTINCT(ID, SubCategory),
DateTime,
Category
FROM sometable
如何获得上面的第二个 table?提前致谢!
SQL 中这种事情的常用方法是按以下方式分组:
select ID, category, subcategory, min(datetime) datetime
from sometable
group by ID, category, subcategory