HiveQL 查询 returns 无结果且无错误
HiveQL query returns no results and no errors
我在 Ubuntu 14.0 上 运行 Apache Hadoop 2.6.0,我在 Hive 0.13.0 中创建了一个 table 作为:
CREATE TABLE IF NOT EXISTS recipes_hive.cuisine (
ID INT COMMENT 'Cuisine ID.',
name STRING COMMENT 'Cusine name - primary key.',
area STRING COMMENT 'Name of the area of origin - foreign key.',
scope STRING COMMENT 'Either country or area.')
COMMENT 'Table containing cuisines data.'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
我用语句填充数据:
LOAD DATA LOCAL INPATH 'path_to_file/CUISINE.csv'
OVERWRITE INTO TABLE recipes_hive.cuisine;
我的数据库有几个这样的 table,它们都是用相同的过程创建和填充的。当 运行 简单查询时:
SELECT * FROM cuisine
甚至在 WHERE 子句中的某些条件下我得到了预期的结果但是 运行 更复杂的查询我得到了蹲下。例如:
SELECT cuisine.name, SUM(IF (ingredient.category = "fruit",1,2))/count(*) AS PERC
FROM cuisine JOIN recipe ON recipe.cuisine = cuisine.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
GROUP BY cuisine.name
ORDER BY PERC DESC
,或:
SELECT ingredient.id, ingredient.name
FROM cuisine JOIN recipe ON recipe.cuisine = cuisine.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
WHERE ingredient.id IN (
SELECT ingredient.id
FROM cuisine c JOIN recipe ON recipe.cuisine = c.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
WHERE c.name = "Pakistan") AND cuisine.name = "Bangladesh"
第一个示例计算一些百分比,第二个示例检查互元素。
MapReduce 和 Hadoop 被正确调用并且它们 return 没有错误。输出以:
结尾
Execution completed successfully
MapredLocal task succeeded
OK
Time taken: 122.119 seconds
我查过网络,有人和我有类似的问题。我检查了:
Hive Table returning empty result set on all queries
Simple Hive query is empty
但未能解决我的问题。数据实际上在 HDFS 中,如前所述,它适用于简单查询。
所以要么我的 Hive 实例有问题,要么我的查询没有写正确。
如有任何帮助,我们将不胜感激。
最好的问候。
您确定生成的连接是非空的吗?因为,你已经实现了内连接,即使一个 table 有缺失的记录,整个结果集是 0。尝试添加一个带 "IS NULL" 的左连接来验证所有 table 的贡献到结果集。如果所有子 table 在它们各自的列 post-join 中都有非空值,那么查询是好的。
如果我们有包含 ID = {1,2,3} 的美食 table 和包含 ID = {5,6,7} 的食谱 table,那么即使这些 tables 是非空的,当我们执行 INNER JOIN Cuisine.ID = Recipe.ID 时,我们仍然没有返回任何行(因为 2 tables 中的 ID 不同)
你能检查一下是否没有这种情况。
SELECT count(1)
FROM cuisine c JOIN recipe ON recipe.cuisine = c.name WHERE c.name = "Pakistan";
--- must return > 0
select count(1) from recipe as recipe
JOIN part_of ON part_of.id_recipe = recipe.id ;
--- must return > 0
select count(1) from part_of as part_of
JOIN ingredient ON ingredient.name = part_of.ingredient ;
--- must return > 0
所以内部查询 returns 当所有计数(*) 都非零时的一行。现在测试外部 select :
SELECT ingredient.id, ingredient.name
FROM cuisine JOIN recipe ON recipe.cuisine = cuisine.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
WHERE ingredient.id = <inner query result> and cuisine.name = "Bangladesh";
我在 Ubuntu 14.0 上 运行 Apache Hadoop 2.6.0,我在 Hive 0.13.0 中创建了一个 table 作为:
CREATE TABLE IF NOT EXISTS recipes_hive.cuisine (
ID INT COMMENT 'Cuisine ID.',
name STRING COMMENT 'Cusine name - primary key.',
area STRING COMMENT 'Name of the area of origin - foreign key.',
scope STRING COMMENT 'Either country or area.')
COMMENT 'Table containing cuisines data.'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
我用语句填充数据:
LOAD DATA LOCAL INPATH 'path_to_file/CUISINE.csv'
OVERWRITE INTO TABLE recipes_hive.cuisine;
我的数据库有几个这样的 table,它们都是用相同的过程创建和填充的。当 运行 简单查询时:
SELECT * FROM cuisine
甚至在 WHERE 子句中的某些条件下我得到了预期的结果但是 运行 更复杂的查询我得到了蹲下。例如:
SELECT cuisine.name, SUM(IF (ingredient.category = "fruit",1,2))/count(*) AS PERC
FROM cuisine JOIN recipe ON recipe.cuisine = cuisine.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
GROUP BY cuisine.name
ORDER BY PERC DESC
,或:
SELECT ingredient.id, ingredient.name
FROM cuisine JOIN recipe ON recipe.cuisine = cuisine.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
WHERE ingredient.id IN (
SELECT ingredient.id
FROM cuisine c JOIN recipe ON recipe.cuisine = c.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
WHERE c.name = "Pakistan") AND cuisine.name = "Bangladesh"
第一个示例计算一些百分比,第二个示例检查互元素。
MapReduce 和 Hadoop 被正确调用并且它们 return 没有错误。输出以:
结尾Execution completed successfully
MapredLocal task succeeded
OK
Time taken: 122.119 seconds
我查过网络,有人和我有类似的问题。我检查了:
Hive Table returning empty result set on all queries
Simple Hive query is empty
但未能解决我的问题。数据实际上在 HDFS 中,如前所述,它适用于简单查询。
所以要么我的 Hive 实例有问题,要么我的查询没有写正确。
如有任何帮助,我们将不胜感激。 最好的问候。
您确定生成的连接是非空的吗?因为,你已经实现了内连接,即使一个 table 有缺失的记录,整个结果集是 0。尝试添加一个带 "IS NULL" 的左连接来验证所有 table 的贡献到结果集。如果所有子 table 在它们各自的列 post-join 中都有非空值,那么查询是好的。
如果我们有包含 ID = {1,2,3} 的美食 table 和包含 ID = {5,6,7} 的食谱 table,那么即使这些 tables 是非空的,当我们执行 INNER JOIN Cuisine.ID = Recipe.ID 时,我们仍然没有返回任何行(因为 2 tables 中的 ID 不同) 你能检查一下是否没有这种情况。
SELECT count(1)
FROM cuisine c JOIN recipe ON recipe.cuisine = c.name WHERE c.name = "Pakistan";
--- must return > 0
select count(1) from recipe as recipe
JOIN part_of ON part_of.id_recipe = recipe.id ;
--- must return > 0
select count(1) from part_of as part_of
JOIN ingredient ON ingredient.name = part_of.ingredient ;
--- must return > 0
所以内部查询 returns 当所有计数(*) 都非零时的一行。现在测试外部 select :
SELECT ingredient.id, ingredient.name
FROM cuisine JOIN recipe ON recipe.cuisine = cuisine.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
WHERE ingredient.id = <inner query result> and cuisine.name = "Bangladesh";