将数据从 spark 保存为文本文件到 hdfs
Save data as text file from spark to hdfs
我使用 pySpark
和 sqlContext
使用以下查询处理数据:
(sqlContext.sql("select LastUpdate,Count(1) as Count" from temp_t)
.rdd.coalesce(1).saveAsTextFile("/apps/hive/warehouse/Count"))
存储格式如下:
Row(LastUpdate=u'2016-03-14 12:27:55.01', Count=1)
Row(LastUpdate=u'2016-02-18 11:56:54.613', Count=1)
Row(LastUpdate=u'2016-04-13 13:53:32.697', Count=1)
Row(LastUpdate=u'2016-02-22 17:43:37.257', Count=5)
但我想将数据存储在 Hive 中 table as
LastUpdate Count
2016-03-14 12:27:55.01 1
. .
. .
以下是我在 Hive 中创建 table 的方法:
CREATE TABLE Data_Count(LastUpdate string, Count int )
ROW FORMAT DELIMITED fields terminated by '|';
我尝试了很多选项但没有成功。请帮我解决这个问题。
您创建了一个 table,现在您需要用您生成的数据填充它。
这可能是来自 Spark HiveContext 的 运行,我相信
LOAD DATA INPATH '/apps/hive/warehouse/Count' INTO TABLE Data_Count
或者,您可能希望在数据
上构建 table
CREATE EXTERNAL TABLE IF NOT Exists Data_Count(
LastUpdate DATE,
Count INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/apps/hive/warehouse/Count';
为什么不将数据加载到 Hive 本身,而不经过保存文件然后将其加载到 Hive 的过程。
from datetime import datetime, date, time, timedelta
hiveCtx = HiveContext(sc)
#Create sample data
currTime = datetime.now()
currRow = Row(LastUpdate=currTime)
delta = timedelta(days=1)
futureTime = currTime + delta
futureRow = Row(LastUpdate=futureTime)
lst = [currRow, currRow, futureRow, futureRow, futureRow]
#parallelize the list and convert to dataframe
myRdd = sc.parallelize(lst)
df = myRdd.toDF()
df.registerTempTable("temp_t")
aggRDD = hiveCtx.sql("select LastUpdate,Count(1) as Count from temp_t group by LastUpdate")
aggRDD.saveAsTable("Data_Count")
我使用 pySpark
和 sqlContext
使用以下查询处理数据:
(sqlContext.sql("select LastUpdate,Count(1) as Count" from temp_t)
.rdd.coalesce(1).saveAsTextFile("/apps/hive/warehouse/Count"))
存储格式如下:
Row(LastUpdate=u'2016-03-14 12:27:55.01', Count=1)
Row(LastUpdate=u'2016-02-18 11:56:54.613', Count=1)
Row(LastUpdate=u'2016-04-13 13:53:32.697', Count=1)
Row(LastUpdate=u'2016-02-22 17:43:37.257', Count=5)
但我想将数据存储在 Hive 中 table as
LastUpdate Count
2016-03-14 12:27:55.01 1
. .
. .
以下是我在 Hive 中创建 table 的方法:
CREATE TABLE Data_Count(LastUpdate string, Count int )
ROW FORMAT DELIMITED fields terminated by '|';
我尝试了很多选项但没有成功。请帮我解决这个问题。
您创建了一个 table,现在您需要用您生成的数据填充它。
这可能是来自 Spark HiveContext 的 运行,我相信
LOAD DATA INPATH '/apps/hive/warehouse/Count' INTO TABLE Data_Count
或者,您可能希望在数据
上构建 tableCREATE EXTERNAL TABLE IF NOT Exists Data_Count(
LastUpdate DATE,
Count INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/apps/hive/warehouse/Count';
为什么不将数据加载到 Hive 本身,而不经过保存文件然后将其加载到 Hive 的过程。
from datetime import datetime, date, time, timedelta
hiveCtx = HiveContext(sc)
#Create sample data
currTime = datetime.now()
currRow = Row(LastUpdate=currTime)
delta = timedelta(days=1)
futureTime = currTime + delta
futureRow = Row(LastUpdate=futureTime)
lst = [currRow, currRow, futureRow, futureRow, futureRow]
#parallelize the list and convert to dataframe
myRdd = sc.parallelize(lst)
df = myRdd.toDF()
df.registerTempTable("temp_t")
aggRDD = hiveCtx.sql("select LastUpdate,Count(1) as Count from temp_t group by LastUpdate")
aggRDD.saveAsTable("Data_Count")