Apache PIG - 错误 org.apache.pig.impl.PigContext - 在第 1 行第 1 列遇到“<OTHER>”,=“”
Apache PIG - ERROR org.apache.pig.impl.PigContext - Encountered " <OTHER> ",= "" at line 1, column 1
我正在尝试使用 Apache PIG 使用来自 Hive table 的数据在我的数据下进行一些数据清理。
我的 Apache PIG 中有这个语句:
INPUT_FILE = LOAD 'staging_area' USING org.apache.hive.hcatalog.pig.HCatLoader()
AS
(ID:Long,
CHAIN:Int,
DEPT:Int,
CATEGORY:Int,
COMPANY:Long,
BRAND:Long,
DATE:Chararray,
QUARTER:Int,
MONTH:Int,
DAY:Int,
WEEKDAY:Int,
PRODUCT_SIZE:Int,
PRODUCT_MEASURE:Chararray,
PRODUCT_QUANTITY:Int,
PURCHASE_AMOUNT:Double);
SPLIT INPUT_FILE INTO DATA IF (PRODUCT_SIZE > 0 AND PURCHASE_AMOUNT > 0 AND PRODUCT_QUANTITY > 0), MISSING_VALUES if (PRODUCT_QUANTITY <= 0 OR PURCHASE_AMOUNT <= 0);
DATA_TRANSFORMATION = FOREACH DATA GENERATE
ID,
CHAIN,
DEPT,
CATEGORY,
ToDate(DATE,'yyyy-MM-dd') as DATE_ID,
QUARTER,
MONTH,
DAY,
WEEKDAY,
PRODUCT_SIZE,
PURCHASE_AMOUNT;
GRP = GROUP DATA_TRANSFORMATION BY ID;
SUMMED = foreach GRP {
amount = SUM(DATA_TRANSFORMATION.PURCHASE_AMOUNT);
cnt = COUNT(DATA_TRANSFORMATION.ID);
generate group, Purchase_Average,Freq_Visits;
}
JOINED = join DATA_TRANSFORMATION by [=12=], SUMMED by [=12=];
DATASET = FOREACH JOINED GENERATE [=12=],,,,,,,,,,,,;
RANKING = rank DATASET by ,,[=12=];
DW = FOREACH RANKING GENERATE as ID, as Purchase_Average, as Freq_Visits, [=12=] as Transaction_ID, ,,,,,,,,,;
STORE DW INTO '/user/cloudera/data' USING PigStorage(',');
Hive 中的 table 有此数据(前 10):
id chain dept category company brand date_id quarter month_id day_id weekday productsize productmeasure purchasequantity purchaseamount
1940424003 46 99 9909 1081843181 25935 29-01-2013 00:00 1 1 29 2 6 OZ 2 5
1940424003 46 35 3504 103500030 13470 04-02-2013 00:00 1 2 4 1 25 OZ 2 5
1940424003 46 91 9115 108048080 1230 08-02-2013 00:00 1 2 8 5 0 LT 1 13.99
1940452798 46 7 706 101200010 17286 09-02-2013 00:00 1 2 9 6 38 OZ 1 5.75
1940452798 46 45 4517 107220575 17340 10-02-2013 00:00 1 2 10 7 16 OZ 1 45
1940452798 46 99 9909 107143070 5072 10-02-2013 00:00 1 2 10 7 12 OZ 1 1.99
1940452798 46 21 2119 1061300868 867 10-02-2013 00:00 1 2 10 7 138 OZ 1 43.8
1940452798 46 56 5616 1071373373 11473 10-02-2013 00:00 1 2 10 7 8 OZ 1 2.5
1940452798 46 7 706 107146474 2142 10-02-2013 00:00 1 2 10 7 15 OZ 1 2
1940452798 46 72 7205 103700030 4294 22-02-2013 00:00 1 2 22 5 6 OZ 1 3
每次我 运行 我的脚本都会收到此错误:
ERROR org.apache.pig.impl.PigContext - Encountered " <OTHER> ",= "" at line 1, column 1
有人知道怎么解决吗?我的数据有 3 000 000 条记录,我使用的是 Cloudera Quickstart VM 5.8。
SUMMED = foreach GRP {
amount = SUM(DATA_TRANSFORMATION.PURCHASE_AMOUNT);
cnt = COUNT(DATA_TRANSFORMATION.ID);
generate group, Purchase_Average,Freq_Visits;
}
您不能在此处投影 Purchase_Average 和 Freq_Visits。
我正在尝试使用 Apache PIG 使用来自 Hive table 的数据在我的数据下进行一些数据清理。
我的 Apache PIG 中有这个语句:
INPUT_FILE = LOAD 'staging_area' USING org.apache.hive.hcatalog.pig.HCatLoader()
AS
(ID:Long,
CHAIN:Int,
DEPT:Int,
CATEGORY:Int,
COMPANY:Long,
BRAND:Long,
DATE:Chararray,
QUARTER:Int,
MONTH:Int,
DAY:Int,
WEEKDAY:Int,
PRODUCT_SIZE:Int,
PRODUCT_MEASURE:Chararray,
PRODUCT_QUANTITY:Int,
PURCHASE_AMOUNT:Double);
SPLIT INPUT_FILE INTO DATA IF (PRODUCT_SIZE > 0 AND PURCHASE_AMOUNT > 0 AND PRODUCT_QUANTITY > 0), MISSING_VALUES if (PRODUCT_QUANTITY <= 0 OR PURCHASE_AMOUNT <= 0);
DATA_TRANSFORMATION = FOREACH DATA GENERATE
ID,
CHAIN,
DEPT,
CATEGORY,
ToDate(DATE,'yyyy-MM-dd') as DATE_ID,
QUARTER,
MONTH,
DAY,
WEEKDAY,
PRODUCT_SIZE,
PURCHASE_AMOUNT;
GRP = GROUP DATA_TRANSFORMATION BY ID;
SUMMED = foreach GRP {
amount = SUM(DATA_TRANSFORMATION.PURCHASE_AMOUNT);
cnt = COUNT(DATA_TRANSFORMATION.ID);
generate group, Purchase_Average,Freq_Visits;
}
JOINED = join DATA_TRANSFORMATION by [=12=], SUMMED by [=12=];
DATASET = FOREACH JOINED GENERATE [=12=],,,,,,,,,,,,;
RANKING = rank DATASET by ,,[=12=];
DW = FOREACH RANKING GENERATE as ID, as Purchase_Average, as Freq_Visits, [=12=] as Transaction_ID, ,,,,,,,,,;
STORE DW INTO '/user/cloudera/data' USING PigStorage(',');
Hive 中的 table 有此数据(前 10):
id chain dept category company brand date_id quarter month_id day_id weekday productsize productmeasure purchasequantity purchaseamount
1940424003 46 99 9909 1081843181 25935 29-01-2013 00:00 1 1 29 2 6 OZ 2 5
1940424003 46 35 3504 103500030 13470 04-02-2013 00:00 1 2 4 1 25 OZ 2 5
1940424003 46 91 9115 108048080 1230 08-02-2013 00:00 1 2 8 5 0 LT 1 13.99
1940452798 46 7 706 101200010 17286 09-02-2013 00:00 1 2 9 6 38 OZ 1 5.75
1940452798 46 45 4517 107220575 17340 10-02-2013 00:00 1 2 10 7 16 OZ 1 45
1940452798 46 99 9909 107143070 5072 10-02-2013 00:00 1 2 10 7 12 OZ 1 1.99
1940452798 46 21 2119 1061300868 867 10-02-2013 00:00 1 2 10 7 138 OZ 1 43.8
1940452798 46 56 5616 1071373373 11473 10-02-2013 00:00 1 2 10 7 8 OZ 1 2.5
1940452798 46 7 706 107146474 2142 10-02-2013 00:00 1 2 10 7 15 OZ 1 2
1940452798 46 72 7205 103700030 4294 22-02-2013 00:00 1 2 22 5 6 OZ 1 3
每次我 运行 我的脚本都会收到此错误:
ERROR org.apache.pig.impl.PigContext - Encountered " <OTHER> ",= "" at line 1, column 1
有人知道怎么解决吗?我的数据有 3 000 000 条记录,我使用的是 Cloudera Quickstart VM 5.8。
SUMMED = foreach GRP {
amount = SUM(DATA_TRANSFORMATION.PURCHASE_AMOUNT);
cnt = COUNT(DATA_TRANSFORMATION.ID);
generate group, Purchase_Average,Freq_Visits;
}
您不能在此处投影 Purchase_Average 和 Freq_Visits。