从 Hive sql 中的第 n 个桶中获取所有记录
Get all record from nth bucket in Hive sql
如何从 hive 中的第 n 个桶中获取所有记录。
Select * 来自桶 9 的 bucketTable;
您可以通过不同的方式实现:
Approach-1: By getting the table stored location
from desc formatted <db>.<tab_name>
Then read the 9th bucket
file directly from HDFS filesystem
.
(或)
Approach-2: Using input_file_name()
Then filter only 9th bucket
data by using filename
Example:
Approach-1:
Scala:
val df = spark.sql("desc formatted <db>.<tab_name>")
//get table location in hdfs path
val loc_hdfs = df.filter('col_name === "Location").select("data_type").collect.map(x => x(0)).mkString
//based on your table format change the read format
val ninth_buk = spark.read.orc(s"${loc_hdfs}/000008_0*")
//display the data
ninth_buk.show()
Pyspark:
from pyspark.sql.functions import *
df = spark.sql("desc formatted <db>.<tab_name>")
loc_hdfs = df.filter(col("col_name") == "Location").select("data_type").collect()[0].__getattr__("data_type")
ninth_buk = spark.read.orc(loc_hdfs + "/000008_0*")
ninth_buk.show()
Approach-2:
Scala:
val df = spark.read.table("<db>.<tab_name>")
//add input_file_name
val df1 = df.withColumn("filename",input_file_name())
#filter only the 9th bucket filename and select only required columns
val ninth_buk = df1.filter('filename.contains("000008_0")).select(df.columns.head,df.columns.tail:_*)
ninth_buk.show()
pyspark:
from pyspark.sql.functions import *
df = spark.read.table("<db>.<tab_name>")
df1 = df.withColumn("filename",input_file_name())
ninth_buk = df1.filter(col("filename").contains("000008_0")).select(*df.columns)
ninth_buk.show()
Approach-2 如果您有大量数据,我们将不推荐,因为我们需要过滤整个数据框..!!
In Hive:
set hive.support.quoted.identifiers=none;
select `(fn)?+.+` from (
select *,input__file__name fn from table_name)e
where e.fn like '%000008_0%';
如果是兽人table
SELECT * FROM orc.<bucket_HDFS_path>
select * from bucketing_table tablesample(bucket n out of y on clustered_criteria_column);
其中 bucketing_table
是您的存储桶 table 名称
n => nth bucket
y => total no. of buckets
如何从 hive 中的第 n 个桶中获取所有记录。
Select * 来自桶 9 的 bucketTable;
您可以通过不同的方式实现:
Approach-1: By getting the table
stored location
fromdesc formatted <db>.<tab_name>
Then read the
9th bucket
file directly fromHDFS filesystem
.
(或)
Approach-2: Using input_file_name()
Then filter only
9th bucket
data by using filename
Example:
Approach-1:
Scala:
val df = spark.sql("desc formatted <db>.<tab_name>")
//get table location in hdfs path
val loc_hdfs = df.filter('col_name === "Location").select("data_type").collect.map(x => x(0)).mkString
//based on your table format change the read format
val ninth_buk = spark.read.orc(s"${loc_hdfs}/000008_0*")
//display the data
ninth_buk.show()
Pyspark:
from pyspark.sql.functions import *
df = spark.sql("desc formatted <db>.<tab_name>")
loc_hdfs = df.filter(col("col_name") == "Location").select("data_type").collect()[0].__getattr__("data_type")
ninth_buk = spark.read.orc(loc_hdfs + "/000008_0*")
ninth_buk.show()
Approach-2:
Scala:
val df = spark.read.table("<db>.<tab_name>")
//add input_file_name
val df1 = df.withColumn("filename",input_file_name())
#filter only the 9th bucket filename and select only required columns
val ninth_buk = df1.filter('filename.contains("000008_0")).select(df.columns.head,df.columns.tail:_*)
ninth_buk.show()
pyspark:
from pyspark.sql.functions import *
df = spark.read.table("<db>.<tab_name>")
df1 = df.withColumn("filename",input_file_name())
ninth_buk = df1.filter(col("filename").contains("000008_0")).select(*df.columns)
ninth_buk.show()
Approach-2 如果您有大量数据,我们将不推荐,因为我们需要过滤整个数据框..!!
In Hive:
set hive.support.quoted.identifiers=none;
select `(fn)?+.+` from (
select *,input__file__name fn from table_name)e
where e.fn like '%000008_0%';
如果是兽人table
SELECT * FROM orc.<bucket_HDFS_path>
select * from bucketing_table tablesample(bucket n out of y on clustered_criteria_column);
其中 bucketing_table
是您的存储桶 table 名称
n => nth bucket
y => total no. of buckets