Hive分区表实战

发表: 2016-12-29 浏览: 4596

Hive

1. Hive分区表

Partition和Bucket，为了提升查询效率，前者是粗粒度的划分，后者是细粒度的划分。
建表语句中使用partitioned by指定分区字段
分区表有静态分区和动态分区两种。若分区的值是确定的，那么称为静态分区字段，反之，若分区的值是非确定的，那么称之为动态分区字段。默认是采用静态分区。

2. 静态分区

应用场景1

每天有很多不同的商店各自会产生成百上千的销售记录，当天的数据当天加载。那么这是一个最简单的示例，每天产生的这些销售记录中，某些客户的销售额如何插入到已有的数仓的表中去。日期是确定的。

create external table if not exists sp_trans(
cust_no string comment '客户号',
shop_no string comment '商铺号',
trans_amt double comment '交易额',
trans_date string comment '交易日期',
etl_ts timestamp comment 'etl时间戳')
partitioned by (dt string)
row format delimited fields terminated by ',';
/* 将数据插入到某一个确定的日起分区中去 */
insert into sp_trans partition (dt='2016-01-13')
select * from transaction where trans_date='2016-01-13';

应用场景2

加入一个分区字段，商铺号。分区字段之间具有层级关系。

create external table if not exists sp_trans2(
cust_no string comment '客户号',
shop_no string comment '商铺号',
trans_amt double comment '交易额',
trans_date string comment '交易日期',
etl_ts timestamp comment 'etl时间戳')
partitioned by (dt string,shop string)
row format delimited fields terminated by ',';
insert into sp_trans2 partition (dt='2016-01-13',shop='9502')
select * from transaction where trans_date='2016-01-13' and shop_no='9502';
insert into sp_trans2 partition (dt='2016-01-13',shop='9507')
select * from transaction where trans_date='2016-01-13' and shop_no='9507';
insert into sp_trans2 partition (dt='2016-01-14',shop='9502')
select * from transaction where trans_date='2016-01-14' and shop_no='9502';

应用场景3

每天按照不同的商铺对交易额进行统计，这里统计的实体是商铺，同样的，日期是固定的。

create external table if not exists sp_shop_daily(
shop_no string,
trans_sum double comment '交易额统计',
etl_ts timestamp)
partitioned by (shop string,dt string)
row format delimited fields terminated by ',';
/*插入汇总后的数据到最终的统计结果所指定的日期分区中去*/
insert into sp_shop_daily partition (shop='9502',dt='2016-01-13')
select shop_no,sum(trans_amt),current_timestamp() from transaction where shop_no='9502' and trans_date='2016-01-13' group by shop_no;

2. 动态分区

应用场景1

每天有很多不同的商店各自会产生成百上千的销售记录，当天的数据当天加载。那么这是一个最简单的示例，每天产生的这些销售记录中，某些客户的销售额如何插入到已有的数仓的表中去。日期是非确定的。
动态分区表的插入原则，分区字段的取值字段放在select子句的最后面

create external table if not exists dp_trans(
cust_no string comment '客户号',
shop_no string comment '商铺号',
trans_amt double comment '交易额',
trans_date string comment '交易日期',
etl_ts timestamp comment 'etl时间戳')
partitioned by (dt string)
row format delimited fields terminated by ',';

insert into dp_trans partition (dt)
select *,trans_date from transaction;

Error 1
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
set hive.exec.dynamic.partition.mode=nonstrict;
Error2
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to 100 partitions per node, number of dynamic partitions on this node: 101
set hive.exec.max.dynamic.partitions.pernode=2000;

应用场景2

加入一个分区字段，商铺号。分区字段之间具有层级关系。

create external table if not exists dp_trans2(
cust_no string comment '客户号',
shop_no string comment '商铺号',
trans_amt double comment '交易额',
trans_date string comment '交易日期',
etl_ts timestamp comment 'etl时间戳')
partitioned by (dt string,shop string)
row format delimited fields terminated by ',';

insert into dp_trans2 partition (shop,dt)
select *,shop_no,trans_date from transaction;

应用场景3

每天按照不同的商铺对交易额进行统计，这里统计的实体是商铺，同样的，日期是非固定的。

create external table if not exists dp_shop_daily(
shop_no string,
trans_sum double comment '交易额统计',
etl_ts timestamp)
partitioned by (shop string,dt string)
row format delimited fields terminated by ',';

insert into dp_shop_daily partition(shop,dt)
select shop_no,sum(trans_amt),current_timestamp(),shop_no,trans_date from transaction group by shop_no,trans_date;

注意，当分区字段包含动态和静态分区字段两种时，动态分区字段不能作为静态分区字段的父级

3. 修改分区

1. 添加分区

alter table dp_trans2 add partition (dt='2016-01-13',shop='001');

2. 重命名

alter table sp_trans2 partition (dt='2016-01-13',shop='001') rename to partition (dt='2016-01-13',shop='000');

3. 交换分区

alter table sp_trans2 exchange partition (dt='2016-01-13',shop='9510') with table dp_trans2;

4. 恢复分区

hive> msck repair table sp_trans2;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

在Hive v2.1.1修复了此问题，可下载最新版本使用

5. 删除分区

alter table sp_trans2 drop partition(dt='2016-01-13',shop='9510');

transaction.tar_.gz

8 个评论

wangguow

Mars 有课程笔记啊，看来课程有更新啊。

Jason_Huang

支持Mars

北丐

transaction表的数据给上传一下呗

MarsJ 回复北丐

已上传

北丐回复 MarsJ

感谢

祁小姐

谢谢MarsJ

城市无人呐喊

Mars您好，请问load加载数据时可否实现动态分区。

MarsJ 回复城市无人呐喊

你是指的直接用load data语句操作的时候吗？还是其他方式load?

要回复文章请先登录或注册

文章目录