即使从不同的文件格式加载,Vertica 如何处理半结构化数据
How Vertica handles semi-structured data even if loaded from different file formats
我对 Vertica
中的半结构化数据处理的理解是,如果数据是这样说的(在 json
中)
{
"f1":1,
"f2":"hello",
"f3":false,
"f4":2
}
然后创建一个 flextable
,其中包含两列 __identity__
和 __raw__
。 __identify__
将有 4 个字段(我假设整数 1、2、3、4),__raw__
将是数据的原始表示(1、hello、false 和 2)。
我还可以在同一个 flextable 中的 csv
文件中加载数据,例如 2, hello2, true, 3
。 Vertica
如何决定哪个字段映射到哪个列(例如 f1
和 f4
)是 int
。
好吧,没有什么比准备好 Vertica SQL 提示(以及创建数据库对象的权限......)更能尝试找出答案了。
使用JSON,字段名称在结构中:键值对。
对于 CSV,数据文件的第一行需要有列名 - 我在下面添加了...
-- connecting with VSQL,
$ vsql -h localhost -d sbx -U dbadmin -w pwd
$ vsql -h localhost -d sbx -U dbadmin -w pwd
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
sbx=> -- create the flex table
sbx=> CREATE FLEX TABLE flx();
CREATE TABLE
sbx=> -- load the flex table from stdin - data handed in-line - using your input
sbx=> COPY flx FROM stdin PARSER fjsonparser();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {
>> "f1":1,
>> "f2":"hello",
>> "f3":false,
>> "f4":2
>> }
>> \.
-- test the load ...
sbx=> SELECT f1,f2,f3,f4 FROM flx;
f1 | f2 | f3 | f4
----+-------+-------+----
1 | hello | false | 2
sbx=>-- load the CSV file - note that we need the title line,
sbx=>-- which I add, to have same values in the same fields
sbx=> COPY flx FROM stdin PARSER fcsvparser();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> f1,f2,f3,f4
>> 2, hello2, true, 3
>> \.
sbx=>-- check the contents now
sbx=> SELECT f1,f2,f3,f4 FROM flx;
f1 | f2 | f3 | f4
----+--------+-------+----
1 | hello | false | 2
2 | hello2 | true | 3
sbx=>-- resulting table definition in catalog ...
sbx=> \d flx
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
---------+-------+--------------+------------------------+--------+---------+----------+-------------+-------------
dbadmin | flx | __identity__ | int | 8 | | t | f |
dbadmin | flx | __raw__ | long varbinary(130000) | 130000 | | t | f |
(2 rows)
sbx=> -- check the contents of __identity__ and (after visualising) __raw__
sbx=> SELECT __identity__,REPLACE(MAPTOSTRING(__raw__),CHR(10),' ') FROM flx;
__identity__ | REPLACE
--------------+------------------------------------------------------------------------
1 | { "f1": "1", "f2": "hello", "f3": "false", "f4": "2" }
2 | { "f1": "2", "f2": "hello2", "f3": "true", "f4": "3" }
我对 Vertica
中的半结构化数据处理的理解是,如果数据是这样说的(在 json
中)
{
"f1":1,
"f2":"hello",
"f3":false,
"f4":2
}
然后创建一个 flextable
,其中包含两列 __identity__
和 __raw__
。 __identify__
将有 4 个字段(我假设整数 1、2、3、4),__raw__
将是数据的原始表示(1、hello、false 和 2)。
我还可以在同一个 flextable 中的 csv
文件中加载数据,例如 2, hello2, true, 3
。 Vertica
如何决定哪个字段映射到哪个列(例如 f1
和 f4
)是 int
。
好吧,没有什么比准备好 Vertica SQL 提示(以及创建数据库对象的权限......)更能尝试找出答案了。
使用JSON,字段名称在结构中:键值对。
对于 CSV,数据文件的第一行需要有列名 - 我在下面添加了...
-- connecting with VSQL,
$ vsql -h localhost -d sbx -U dbadmin -w pwd
$ vsql -h localhost -d sbx -U dbadmin -w pwd
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
sbx=> -- create the flex table
sbx=> CREATE FLEX TABLE flx();
CREATE TABLE
sbx=> -- load the flex table from stdin - data handed in-line - using your input
sbx=> COPY flx FROM stdin PARSER fjsonparser();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {
>> "f1":1,
>> "f2":"hello",
>> "f3":false,
>> "f4":2
>> }
>> \.
-- test the load ...
sbx=> SELECT f1,f2,f3,f4 FROM flx;
f1 | f2 | f3 | f4
----+-------+-------+----
1 | hello | false | 2
sbx=>-- load the CSV file - note that we need the title line,
sbx=>-- which I add, to have same values in the same fields
sbx=> COPY flx FROM stdin PARSER fcsvparser();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> f1,f2,f3,f4
>> 2, hello2, true, 3
>> \.
sbx=>-- check the contents now
sbx=> SELECT f1,f2,f3,f4 FROM flx;
f1 | f2 | f3 | f4
----+--------+-------+----
1 | hello | false | 2
2 | hello2 | true | 3
sbx=>-- resulting table definition in catalog ...
sbx=> \d flx
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
---------+-------+--------------+------------------------+--------+---------+----------+-------------+-------------
dbadmin | flx | __identity__ | int | 8 | | t | f |
dbadmin | flx | __raw__ | long varbinary(130000) | 130000 | | t | f |
(2 rows)
sbx=> -- check the contents of __identity__ and (after visualising) __raw__
sbx=> SELECT __identity__,REPLACE(MAPTOSTRING(__raw__),CHR(10),' ') FROM flx;
__identity__ | REPLACE
--------------+------------------------------------------------------------------------
1 | { "f1": "1", "f2": "hello", "f3": "false", "f4": "2" }
2 | { "f1": "2", "f2": "hello2", "f3": "true", "f4": "3" }