关键组屋 - 投诉 "Data line too long.likely due to invalid csv data"

Pivotal HDB -Complaints "Data line too long.likely due to invalid csv data"

我们有一个小型的关键 Hadoop-hawq cluster.We 已经在其上创建了外部 table 并指向 hadoop 文件。

给定环境:

产品版本: (HAWQ 1.3.0.2 build 14421) x86_64-unknown-linux-gnu,由 GCC gcc (GCC) 4.4.2

编译

尝试过:

当我们尝试使用命令从外部 table 读取时。 即

test=# select count(*) from EXT_TAB ; GETTING following error : ERROR: data line too long. likely due to invalid csv data (seg0 slice1 SEG0.HOSTNAME.COM:40000 pid=447247) 
DETAIL: External table trcd_stg0, line 12059 of pxf://hostname/tmp/def_rcd/?profile=HdfsTextSimple: "2012-08-06 00:00:00.0^2012-08-06 00:00:00.0^6552^2016-01-09 03:15:43.427^0005567^COMPLAINTS ..."  :

附加信息:

外部 table 的 DDL 是:

CREATE READABLE EXTERNAL TABLE sysprocompanyb.trcd_stg0
(
    "DispDt" DATE,
    "InvoiceDt" DATE,
    "ID" INTEGER,
    time timestamp without time zone,
    "Customer" CHAR(7),
    "CustomerName" CHARACTER VARYING(30),
    "MasterAccount" CHAR(7),
    "MasterAccName" CHAR(30),
    "SalesOrder" CHAR(6),
    "SalesOrderLine" NUMERIC(4, 0),
    "OrderStatus" CHAR(200),
    "MStockCode" CHAR(30),
    "MStockDes" CHARACTER VARYING(500),
    "MWarehouse" CHAR(200),
    "MOrderQty" NUMERIC(10, 3),
    "MShipQty" NUMERIC(10, 3),
    "MBackOrderQty" NUMERIC(10, 3),
    "MUnitCost" NUMERIC(15, 5),
    "MPrice" NUMERIC(15, 5),
    "MProductClass" CHAR(200),
    "Salesperson" CHAR(200),
    "CustomerPoNumber" CHAR(30),
    "OrderDate" DATE,
    "ReqShipDate" DATE,
    "DispatchesMade" CHAR(1),
    "NumDispatches" NUMERIC(4, 0),
    "OrderValue" NUMERIC(26, 8),
    "BOValue" NUMERIC(26, 8),
    "OrdQtyInEaches" NUMERIC(21, 9),
    "BOQtyInEaches" NUMERIC(21, 9),
    "DispQty" NUMERIC(38, 3),
    "DispQtyInEaches" NUMERIC(38, 9),
    "CustomerClass" CHAR(200),
    "MLineShipDate" DATE
)
LOCATION (
    'pxf://HOSTNAME-HA/tmp/def_rcd/?profile=HdfsTextSimple'
)
FORMAT 'CSV' (delimiter '^' null '' escape '"' quote '"')
ENCODING 'UTF8';

如有任何帮助,我们将不胜感激?

基于源代码: https://github.com/apache/incubator-hawq/blob/e48a07b0d8a5c8d41d2d4aaaa70254867b11ee11/src/backend/commands/copy.c

错误发生在cstate->line_buf.len >= gp_max_csv_line_length为真时。 根据:http://hawq.docs.pivotal.io/docs-hawq/guc_config-gp_max_csv_line_length.html

csv 的默认长度为 1048576 字节。您是否检查过您的 csv 文件长度并尝试增加此设置的值?

检查第 12059 行的列数是否与分隔字段数匹配。如果某些行在解析过程中组合在一起,那么我们可能会超过最大行长度。这通常是由于数据错误而发生的 回显 $LINE | awk -F "^" '(总计 = 总计 + NF); END {打印总数}'