摄取 Parquet 文件会出现 UTF-8 错误 [Druid 0.12.0]
Ingesting Parquet file gives UTF-8 error [Druid 0.12.0]
我有一个 AWS Glue 生成的 Parquet 文件。我已经安装了 Parquet 和 Avro 扩展(都尝试使用 0.12.0 和 0.12.1)并且在每种情况下我都会收到以下错误
$ >curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/master.parquet localhost:8090/druid/indexer/v1/task
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 500 </title>
</head>
<body>
<h2>HTTP ERROR: 500</h2>
<p>Problem accessing /druid/indexer/v1/task. Reason:
<pre> javax.servlet.ServletException: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x27
at [Source: HttpInputOverHTTP@149d71fc[c=8000,q=1,[0]=Content@519fed0b{HeapByteBufferR@67183cce[p=8000,l=8192,c=8192,r=192]={PAR1\x15\x04\x15\xC0\x81\x01\x15\xF4'L\x15\xA0\t...X\xA2\xC7\x1c\xB7\xCc\x81\xC9\x1c\x984\x82I#s<<<42\xC7\x1dt<B\xC7\x1cs\xC0\xE3H\x1fx\xCc\x81...\xE2\x08$\xAa`R\x87#\xB0`RI\x1d\x90\xD4>>>}},s=STREAM]; line: 1, column: 14]</pre></p>
<hr /><a href="http://eclipse.org/jetty">Powered by Jetty:// 9.3.19.v20170502</a><hr/>
</body>
</html>
== JSON 配置文件 ==
$ >更多 quickstart/master.json
{
"type" : "index_hadoop",
"spec" : {
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "static",
"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
"paths" : "quickstart/master.parquet"
}
},
"dataSchema" : {
"dataSource" : "master",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2010-03-01/2020-05-28"]
},
"parser" : {
"type" : "parquet",
"parseSpec" : {
"format" : "timeAndDims",
"dimensionsSpec" : {
"dimensions" : [
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "ndate"
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
},
{
"name" : "collection_USD_SUM",
"type" : "longSum",
"fieldName" : "collection_USD"
},
{
"name" : "order_count",
"type" : "hyperUnique",
"fieldName" : "orderNumber"
},
{
"name" : "lead_count",
"type" : "count",
"fieldName" : "Sales.leads"
}
]
},
"tuningConfig" : {
"type" : "hadoop",
"partitionsSpec" : {
"type" : "hashed",
"targetPartitionSize" : 5000000
},
"jobProperties" : {}
}
}
}
有线索吗?
1.you 应该修改此命令(将 master.parquet 更改为 master.json ): $ >curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/master.镶木地板
2.在json配置文件中,"paths"应该写入数据路径
检查您的 s3 文件数据格式并检查 hadoop jar
原因:java.lang.IllegalArgumentException:无法构造 java.lang.Class 的实例,问题:io.druid.data.input.parquet.DruidParquetInputFormat
我有一个 AWS Glue 生成的 Parquet 文件。我已经安装了 Parquet 和 Avro 扩展(都尝试使用 0.12.0 和 0.12.1)并且在每种情况下我都会收到以下错误
$ >curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/master.parquet localhost:8090/druid/indexer/v1/task
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 500 </title>
</head>
<body>
<h2>HTTP ERROR: 500</h2>
<p>Problem accessing /druid/indexer/v1/task. Reason:
<pre> javax.servlet.ServletException: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x27
at [Source: HttpInputOverHTTP@149d71fc[c=8000,q=1,[0]=Content@519fed0b{HeapByteBufferR@67183cce[p=8000,l=8192,c=8192,r=192]={PAR1\x15\x04\x15\xC0\x81\x01\x15\xF4'L\x15\xA0\t...X\xA2\xC7\x1c\xB7\xCc\x81\xC9\x1c\x984\x82I#s<<<42\xC7\x1dt<B\xC7\x1cs\xC0\xE3H\x1fx\xCc\x81...\xE2\x08$\xAa`R\x87#\xB0`RI\x1d\x90\xD4>>>}},s=STREAM]; line: 1, column: 14]</pre></p>
<hr /><a href="http://eclipse.org/jetty">Powered by Jetty:// 9.3.19.v20170502</a><hr/>
</body>
</html>
== JSON 配置文件 ==
$ >更多 quickstart/master.json
{
"type" : "index_hadoop",
"spec" : {
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "static",
"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
"paths" : "quickstart/master.parquet"
}
},
"dataSchema" : {
"dataSource" : "master",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2010-03-01/2020-05-28"]
},
"parser" : {
"type" : "parquet",
"parseSpec" : {
"format" : "timeAndDims",
"dimensionsSpec" : {
"dimensions" : [
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "ndate"
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
},
{
"name" : "collection_USD_SUM",
"type" : "longSum",
"fieldName" : "collection_USD"
},
{
"name" : "order_count",
"type" : "hyperUnique",
"fieldName" : "orderNumber"
},
{
"name" : "lead_count",
"type" : "count",
"fieldName" : "Sales.leads"
}
]
},
"tuningConfig" : {
"type" : "hadoop",
"partitionsSpec" : {
"type" : "hashed",
"targetPartitionSize" : 5000000
},
"jobProperties" : {}
}
}
}
有线索吗?
1.you 应该修改此命令(将 master.parquet 更改为 master.json ): $ >curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/master.镶木地板
2.在json配置文件中,"paths"应该写入数据路径
检查您的 s3 文件数据格式并检查 hadoop jar
原因:java.lang.IllegalArgumentException:无法构造 java.lang.Class 的实例,问题:io.druid.data.input.parquet.DruidParquetInputFormat