将数据从 S3 复制到 Redshift [数字数据中的精度问题]

Question

我正在使用以下命令将文本文件中的数据复制到 redshift table：-

从 's3://gamma-audit-calculation-output-ngr-data-json/2021/05/10/08/kinesis-calculation-output-ngr-data-1-2021-05-10-09-48-24-82ecea90-ef50-4907-82d7-8b162ca2b841' 凭证中复制 redshift_table_name iam_role json 'auto';

附加指定路径中的文件。

文件中的数据是：-

{"totalgross":6113.47,"totalnetpay":3661.6,"calculationtime":"05/10/2021 02:48:24 AM PDT","dynamicngrlaunched":true,"employeeid":"881448","totalanytimepaywithdrawals":6.62,"totalimputedincome":12.1,"paycheckdate":"2021-04-30","calculationtimeepochmillis":"1620640104258","ngr":0.60,"totalanytimepayrepayments":0.0,"otherrepayments":0.0,"payenddate":"2021-04-30","employeeid_calculationtimeepochmillis":"881448_1620640104258"}

我的红移 table 的模式是：-

create table table_name ( employeeid varchar(65535), ngr numeric(17, 2), totalgross numeric(17, 2), totalnetpay numeric(17, 2), earningamount numeric(17, 2), totalimputedincome numeric(17, 2), totalanytimepaywithdrawals numeric(17, 2), totalanytimepayrepayments numeric(17, 2), dynamicngrlaunched boolean, paycheckdate varchar(65535), payenddate varchar(65535), calculationtime varchar(65535), otherRepayments numeric(17, 2), calculationtimeepochmillis bigint, employeeid_calculationtimeepochmillis varchar(65535) ) DISTKEY (employeeid) SORTKEY (calculationtimeepochmillis);

这里我面临的问题是保存到 Redshift table 时 ngr 值变为 0.59 而不是 0.60。这怎么可能？

Answer 1

(
  employeeid varchar(65535),
  ngr numeric(17, 2),
  totalgross numeric(17, 2),
  totalnetpay numeric(17, 2),
  earningamount numeric(17, 2),
  totalimputedincome numeric(17, 2),
  totalanytimepaywithdrawals numeric(17, 2),
  totalanytimepayrepayments numeric(17, 2),
  dynamicngrlaunched boolean,
  paycheckdate varchar(65535),
  payenddate varchar(65535),
  calculationtime varchar(65535),
  otherRepayments numeric(17, 2),
  calculationtimeepochmillis bigint,
  employeeid_calculationtimeepochmillis varchar(65535)
)
DISTKEY (employeeid)
SORTKEY (calculationtimeepochmillis);

在开始其他任何事情之前，我会以最强烈的条款建议您不要使用最大长度varchar。最后我知道，当行被放入内存时，它们使用的内存量等于它们的最大长度，如 DDL 中指定的那样。您有五个 varchar(65535)，因此 table 中的行使用了 320 KB 内存。

请记住，可用内存分为队列和插槽，然后跨片，因此您可能实际上没有太多可用内存 - 它可能会有很大差异，但它可能总共大约 100mb - 如果你如果要进行散列连接，您需要确保散列连接中较小的 table 可以在散列时适合内存，否则性能会下降。如果你有一个查询运行，它需要内存来做其他事情，所以如果你说 100mb，你可能会说最多一半可用于你的散列，当你有 320kb 行时，50mb 给你最大在您的 table 中大约 一百五十行。你当然可以直接通过这个 - Redshift 不会阻止你，它不会以任何方式警告你 - 但性能会下降，你不知道为什么。

还要注意您的数字，不要超过 19 的精度。当精度为 19 或更小时，numeric 为八个字节，但当精度为 20 或更大时，numeric 变为十六个字节（不管你实际存储的值是多少）并且必须由数学库处理，而不是直接由处理器硬件处理。

此外，请记住尽可能使用 NOT NULL，因为它会减小列的大小。这对于 boolean 尤其重要，当 NOT NULL 时每个值一位，但当 NULL 时每个值两位，而对于 varchar 则为 NULL 将一个字节添加到为字符串存储的数据大小。

最后，您没有设置任何编码。 Redshift 会为您选择它们，但它在选择编码方面做得很糟糕。我会强烈建议您选择自己的编码。

现在，谈谈你的问题。

Here the problem I am facing is that the ngr value while getting saved to Redshift table changes to 0.59 instead of 0.60. How can this be possible?

我可能错了，我需要测试来检查，但我可能猜到数字首先被读取为浮点数，然后转换为数字。

整数（这就是 numeric 的本质）和浮点数的行为不同。

整数是准确的。浮点数不是。我的意思是说，当你存储一个整数时，你总是会得到你存储的数字。浮点数不是这种情况。如果您将最小和最大浮点数之间的连续数字想象成尖桩篱笆，那么您的篱笆不时由 post 组成，它进入地球，只有 posts可以存储；所以当你存储一个数字时，它会被转换为最近的可存储数字，那就是存储的，这就是你得到的。

因此，当您存储 0.60 时，在 0.60 处没有“post” - 最近的是 0.59，因此存储的是 0.59，这就是您读取数字时得到的结果。

如果希望数字准确，可以将数字乘以 10 的幂，使小数部分始终为零，然后将它们存储为整数。因此，对于 0.59 的情况，如果我假设所有数字的小数部分都有两位小数，则将数字乘以 100，因此 0.59 变为 59，然后将 59 存储为整数。使用整数完成所有数学运算，然后在最后阶段最终转换回浮点数。

David Goldberg 有一篇著名的白皮书“What Every Computer Scientist Should Know About Floating-Point Arithmetic”，其中解释了这个问题；

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Answer 2

我认为 user15782476 的回复完美地描述了问题的根本原因。当我在同一个问题上苦苦挣扎时，它对我理解正在发生的事情有很大帮助。不过，我已经找到了解决该问题的方法，这可能会有所帮助。为了解决 Redshift 截断小数而不是四舍五入的问题，您需要稍微调整该值。例如。您希望 0.60 持久化，然后将 0.601 写入数据文件。这会产生一个浮动值 0.600999 而不是 0.599999，并且会被正确截断为 0.60。这是一个 Java 方法，它使用反射来调整所有 BigDecimal 字段值并计算出校正值（例如，对于 scale = 2 的 0.60，它将添加 0.001）

private <T> T adjustNumericFields(T entity) {  
  for (Field f : entity.getClass().getDeclaredFields()) {
    if (f.getType().equals(BigDecimal.class)) {
      try {
        f.setAccessible(true);
        BigDecimal bigDecimal = (BigDecimal) f.get(entity);
        if (bigDecimal != null && bigDecimal.doubleValue() > 0d) {
          int scale = bigDecimal.scale();
          BigDecimal correction = BigDecimal.ONE.movePointLeft(scale + 1);
          f.set(entity, bigDecimal.add(correction));
        }
      } catch (Exception e) {
        log.warn("Could not adjust value: {}/{} - {}", entity.getClass().getSimpleName(), f.getName(), e.getMessage());
      }
    }
  }
  return entity;
}

将数据从 S3 复制到 Redshift [数字数据中的精度问题]

Copy Data From S3 to Redshift [Precision issue in numeric data]

amazon-s3

amazon-web-services

amazon-redshift