使用 HiveQL 正则表达式提取句点之前的所有字符？

Question

我有一个 table 看起来像：

bl.ah
foo.bar
bar.fight

而且我想使用 HiveQL 的 regexp_extract 到 return

bl
foo
bar

Answer 1

给出关于 regexp_extract 的 docs 数据：

regexp_extract(string subject, string pattern, int index)

Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc. The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.

所以，如果你有一个只有一个列的 table（在我们的例子中我们称它为 description）你应该能够使用 regexp_extract 来获取数据在句号之前，如果有的话，或者在没有句号的情况下整个字符串：

regexp_extract(description,'^([^\.]+)\.?',1)

正则表达式的组成部分如下：

^ 字符串开头
([^\.]+) 任何非句点字符一次或多次，在捕获组中
\.? 一次或多次

因为我们感兴趣的字符串部分将在第一个（也是唯一一个）捕获组中，我们通过向 index 参数传递值 1 来引用它.

使用 HiveQL 正则表达式提取句点之前的所有字符？

Extract all characters before a period with HiveQL regex?

regex

sql

hive

hiveql