impala 中常见字母的记录高效分类

Question

我在 impala (TBL1) 中有一个 table，其中包含具有不同首字母数量的不同名称。 table 包含大约 3M 条记录。我想向 table 添加一个新属性，其中每个常见的首字母都有一个 class。它与 DENSE_RANK 的工作方式相同，但首字母的数量是动态的。首字母相同的个数不能少于p=3个字母（p=参数）
这是 table 和所需结果的示例：

|  ID   |   Attr1      | New_Attr1   | Some more attribute...
+-------+--------------+-------------+-----------------------
|  1    | ZXA-12       |  1          |
|  2    | YL3300       |  2          |
|  3    | ZXA-123      |  1          |
|  4    | YL3400       |  2          |
|  5    | YL3-aaa      |  2          |
|  6    | TSA 789      |  3          |

...

Answer 1

这是你想要的吗？

select t.*,
       dense_rank() over (order by strleft(attr1, 3)) as newcol
from . . .;

“3”是您的参数。

请注意：在您的示例中，您似乎已按相反的字母顺序分配了新值。因此，您需要 desc 作为 order by.

impala 中常见字母的记录高效分类

Efficient Classification of records by common letters in impala

regex

sql

dense-rank

impala