Hive的复杂数据类型
- 数组:ARRAY,有顺序的且每一个数组元素都必须是相同的数据类,从0开始进行索引
- 键值对:MAP<primitive_type,data_type>, 例,map<string,float>
- 结构体:STRUCT<col_name1:data_type,col_name2:data_type….>
- 联合体:UNIONTYPE<data_type,data_type…>,由于目前UNONTYPE的支持还不是很完善,主要体现在执行一些例如Join/where/group by等子句的时候会报错。
1. Hive 3种复杂类型基本使用
ctrl+v+A=\001,ctrl+v+B=\002,ctrl+v+C=\003
create external table if not exists emp(
id int comment '员工id',
name string comment '员工姓名',
salary float comment '员工薪资',
family array<string> comment '家庭成员姓名',
deductions map<string,float> comment '税收项',
address struct<city:string,street:string,zip:int> comment '住址(城市,街道,编码)'
)
row format delimited
fields terminated by '\001'
collection items terminated by '\002'
map keys terminated by '\003'
lines terminated by '\n'
stored as textfile;
create external table if not exists emp1(
id int comment '员工id',
name string comment '员工姓名',
salary float comment '员工薪资',
family array<string> comment '家庭成员姓名',
deductions map<string,float> comment '税收项',
address struct<city:string,street:string,zip:int> comment '住址(城市,街道,编码)'
)
row format delimited
fields terminated by ','
collection items terminated by ' '
map keys terminated by ':';
load data local inpath '/var/root/complex/emp1.txt' into table emp1;
包含复杂类型的表的操作
select family[2] from emp1;
select size(family) from emp1;
select array_contains(family,'Miao') from emp1;
select sort_array(family) from emp1;
select deductions["Tax"] from emp1;
select size(deductions) from emp1;
select map_keys(deductions) from emp1;
select map_values(deductions) from emp1;
select address.city,address.zip from emp1;
2. Hive 3种复杂类型的嵌套
1. ARRAY 和 STRUCT嵌套
create external table if not exists emp2(
id int,
name string,
salary float,
info array<struct<city:string,street:string,zip:int>>
)
row format delimited
fields terminated by ','
collection items terminated by ' '
map keys terminated by ':';
load data local inpath '/var/root/complex/emp2.txt' into table emp2;
2. STRUCT 和 MAP嵌套
create external table if not exists emp3(
id int,
name string,
info struct<
salary:float,
deductons:map<string,float>>
)
row format delimited
fields terminated by ','
collection items terminated by ' '
map keys terminated by ':';
load data local inpath '/var/root/complex/emp3.txt' into table emp3;
1、各个字段之间
2、Struct的各个成员之间,即salary和deductions之间
3、deductions的各个KeyValue对之间,即Tax:.2 Insurance:.05
4、每个KeyValue对的Key和Value之间,Tax:.2
create external table if not exists emp4(
id int,
name string,
info struct<
salary:float,
deductions:map<string,float>>
);
第一层:\001
第二层:\002
第三层:\003
第四层:\004
load data local inpath '/var/root/complex/emp4.txt' into table emp4;
3. ARRAY/MAP/STRUCT
create external table if not exists emp5(
id int,
name string,
info struct<
family:array<map<string,string>>,
address:struct<city:string,street:string,zip:int>
>
);
1、字段之间
2、info的成员之间,即family和address之间
3、family的各个元素之间、address的各个成员之间
4、family每个元素的各个KeyValue对之间
5、family的每一个元素的Key和Value之间
load data local inpath '/var/root/complex/emp5.txt' into table emp5;
结论:当嵌套的层级超过了3层之后,自定义的分隔符无法满足字段定义需求,会造成数据分隔错误导致位移,此时只能用默认分隔符进行多层级的分隔。当然,在正常的情况下,不会设计特别复杂的数据类型嵌套。
Load data的方式,需要自己输入分隔符,但是大多数情况下,进行ETL的时候,数据是通过insert into …的方式进行加载,此时分隔符就由Hive表定义时指定。
数据文件: