Athena 中的解串器不存在错误

Question

我在尝试创建固定宽度 table 时遇到此错误。前 7 个占位符用于第一列，然后是 space，然后是从第 9 位开始的第二列。

我收到这个错误：

HIVE_SERDE_NOT_FOUND: deserializer does not exist: org.apache.hadoop.hive.contrib.serde2.RegexSerDe

这个create table语句不会失败，但是在选择数据的时候会出现上面提到的错误。

CREATE EXTERNAL TABLE IF NOT EXISTS hunspell.frequency1(
  `count` string,
  `word` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ("input.regex" = "(.{7})(.{100})" ) LOCATION 's3://hunspell/frequency/'

更新：

此测试 table 按预期工作。它提取 第一、第二 和第五列。

CREATE EXTERNAL TABLE hunspell.citiesr1 (id int, city_org string, ppl float) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ('input.regex'='^(\d+)\t([^\t]*)\t\S+\t\S+\t(\d++.\d++).*') LOCATION 's3://hunspell/myserde/';

数据如下所示：

1   東京  Tokyo   Japan   33.8
2   大阪  Osaka   Japan   16.7
11  北京  Beijing China   13.2
12  廣州  Guangzhou   China   15.3
21  Αθηνα   Athens  Greece  3.7
31  Якутск  Yakutsk Russia  0.6
110 La Coruña   Corunna Spain   0.37
112 Cádiz   Cadiz   Spain   0.4
120 Köln    Cologne Germany 0.97
121 München Munich  Germany 1.2
130 Tårnby  Tarnby  Danmark 0.04
140 Tønsberg    Tonsberg    Norway  0.05
150 Besançon    Bisanz  France  0.12

上面发布的示例数据使用制表符作为分隔符。我的文件没有用制表符分隔。假设前 4 个字符是频率，接下来的 10 个是 ID，后面可能会或可能不会后跟 100 个字符的名称。

  1 1050174
  1 1050175
  1 1050177

换句话说，我如何使用正则表达式 serde 在 Athena 中导入固定宽度的数据？

更新 2：

多亏了回答，我可以使用这个导入数据：

CREATE EXTERNAL TABLE `frequency`(
  `count` string, 
  `word` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ('input.regex'='^(.{7}) (.+)$')
LOCATION 's3://XunspellX/stack/';

是否可以将第一列“计数”设为整数？如果我只是将第一列类型更改为整数，Athena 不会导入任何内容。

Answer 1

您在问题的第一部分收到了错误消息，因为您使用的 SerDe 不受 Athena 支持。

HIVE_SERDE_NOT_FOUND: deserializer does not exist: org.apache.hadoop.hive.contrib.serde2.RegexSerDe

Athena docs 列出了支持的 SerDes，因此对于正则表达式，您将使用 org.apache.hadoop.hive.serde2.RegexSerDe，稍后您会发现。

对于基于长度的列解析，您应该能够根据需要修改您的正则表达式模式。下面是一些基于您提供的示例数据和描述的不同示例。

场景 1

两个 space 分隔的数字列，可能前导 and/or 尾随 space。

  1 1050174
  1 1050175
  1 1050177

模式：/^ *(\d+) (\d+) *$/gm

场景 2

The first 7 place holders are for the first column followed by a space and then the second column starting from 9th position

0000001 105017434786
0000002 105013
0000003 1050177438

模式：/^(\d{7}) (\d+)$/gm

场景 3

My file is not delimited by tab. Let's assume the first 4 characters are frequency, the next 10 are ID that may or may not be followed by name of 100 characters

假设在这种情况下，您的数据没有像其他两个示例中那样由 space 分隔，并且第三列可以为 null，最大长度为 100，您可能有这样的内容：

00201050174347some-nullable-third-column-value
01931050174348
19841050174349another-nullable-third-column-value

模式：/^(\d{4})(\d{10})(.{0,100})$/gm

最后，将它们放在一起，您将得到一个 create table 语句，对于第三种情况看起来像这样：

CREATE EXTERNAL TABLE sample_database.sample_data (frequency int, id bigint, description varchar(100)) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ('input.regex'='^(\d{4})(\d{10})(.{0,100})$')
LOCATION 's3://your-s3-bucket/';

请注意此处正则表达式模式中的双反斜杠。每 this doc:

Note: RegexSerDe follows the Java standard. Because the backslash is an escape character in the Java String class, you must use a double backslash to define a single backslash. For example, to define \w, you must use \w in your regex.

Answer 2

根据用户 fearerjon 的建议，以下查询按预期工作：

CREATE EXTERNAL TABLE `frequency`(
  `count` int,
  `word` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ('input.regex'='^ *(\d+) (.+)$')
LOCATION 's3://XunspelX/stack/';

Athena 中的解串器不存在错误

deserializer does not exist error in Athena

regex

presto

amazon-athena

场景 1

场景 2

场景 3