Apache drill 无法正确解析带有 windows EOL 的 CSV 文件?

Apache drill cannot parse CSV files with windows EOL correctly?

好的,让我们为某人节省 8 小时的无能调试时间。

TL;DR: Apache drill cannot correctly parse CSV files generated on windows machines. That's because their EOL is set to \r\n by default unlike to unix system, where it is set to \n. And this leads to horribly undebuggable errors because the leading \r probably stays clued to the last field's value. And what's funny, you won't notice this because it's invisible.

让我们有两个文件,一个在 linux 中创建,第二个在 windows 中创建:hello.linux.csvhello.win.csv。内容是一样的(至少看起来是...

field_a,field_b
Hello,0.5

我们来提问吧。

SELECT * from (...)/hello.linux.csv;
---
field_a, field_b
Hello, "0.5"

SELECT * from (...)/hello.win.csv;
---
field_a, field_b
Hello, "0.5"

很好!让我们对数据做点什么。将“0.5”转换为数字应该没问题(而且是必要的)。

SELECT 
   field_a, CAST (field_b as DECIMAL(10, 2)) as test 
from (...)/hello.linux.csv;
---
field_a, test
Hello, 0.5


-- ... aaand, here we go!
SELECT 
   field_a, CAST (field_b as DECIMAL(10, 2)) as test 
from (...)/hello.win.csv;

[30038]Query execution error. Details:[
SYSTEM ERROR: NumberFormatException
Fragment 0:0
Please, refer to logs for more information.  -- In the logs, there is only useless java stacktrace, of course.
[Error Id: 3551c939-3f5b-42c1-9b58-d600da5f12a0 on drill-develop-7bdb45c597-52rnz:31010]
]
...

(现在,想象一下在复杂的生产设置中揭示这一点需要多少时间,其中查询、数据和其他因素在某种程度上更复杂。)

问题:有没有办法强制 apache drill (v 1.15) 处理使用 windows EOL 创建的 CSV 文件?

您可以将 csv 格式行定界符更新为 \r\n,但这将适用于文本插件范围内的所有 csv 文件。要更改每个 table 的分隔符,请使用 table 函数。

https://drill.apache.org/docs/plugin-configuration-basics/