首页 > > 网络编程 > 其它 >

Hive的DDL操作

2018-08-02 05:57:19来源：博客园阅读 ()

DDL（数据定义语言）操作

　　Hive配置单元包含一个名为 default 默认的数据库.

　　　　create database [if not exists] <database name>；---创建数据库

　　　　show databases | schemas; --显示所有数据库

　　　　drop database if exists <database name> [restrict|cascade]; --删除数据库，默认情况下，hive不允许删除含有表的数据库，要先将数据库中的表清空才能drop，否则会报错
　　　　--加入cascade关键字，可以强制删除一个数据库,默认是restrict，表示有限制的
　　　　　　eg. hive> drop database if exists users cascade;

　　　　use <database name>; --切换数据库

　　显示命令

　　　　show tables; --显示当前库中所有表

　　　　show partitions table_name; --显示表的分区，不是分区表执行报错

　　　　show functions; --显示当前版本 hive 支持的所有方法

　　　　desc extended table_name; --查看表信息

　　　　desc formatted table_name; --查看表信息（格式化美观）

　　　　describe database database_name; --查看数据库相关信息

　　创建表

　　　　create [external] table [if not exists] table_name
　　　　　　[(col_name data_type [comment col_comment], ...)]
　　　　　　[comment table_comment]
　　　　　　[partitioned by (col_name data_type [comment col_comment], ...)]
　　　　　　[clustered by (col_name, col_name, ...)
　　　　　　[sorted by (col_name [asc|desc], ...)] into num_buckets buckets]
　　　　　　[row format row_format]
　　　　　　[stored as file_format]
　　　　　　[location hdfs_path]

　　　　说明：

　　　　　　1、 create table 创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常；用户可以用 if not exists 选项来忽略这个异常。

　　　　　　2、 external关键字可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径（ LOCATION）。

　　　　　　Hive 创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。

　　　　　　3、 like允许用户复制现有的表结构，但是不复制数据。

　　　　　　create [external] table [if not exists] [db_name.]table_name like existing_table;

　　　　　　4、row format delimited（指定分隔符）

　　　　　　　　[fields terminated by char]

　　　　　　　　[collection items terminated by char]

　　　　　　　　[map keys terminated by char]

　　　　　　　　[lines terminated by char] | serde serde_name

　　　　　　　　[with serdeproperties(property_name=property_value, property_name=property_value,...)]

　　　　　　hive 建表的时候默认的分割符是'\001'，若在建表的时候没有指明分隔符，load 文件的时候文件的分隔符需要是'\001'；若文件分隔符不是'001'，程序不会报错，但表查询的结果会全部为'null'；

　　　　　　用 vi 编辑器 Ctrl+v 然后 Ctrl+a 即可输入'\001' -----------> ^A

　　　　　　SerDe 是 Serialize/Deserilize 的简称，目的是用于序列化和反序列化。

　　　　　　Hive 读取文件机制：首先调用 InputFormat（默认 TextInputFormat），返回一条一条记录（默认是一行对应一条记录）。然后调用SerDe （默认LazySimpleSerDe）的 Deserializer，将一条记录切分为各个字段（默认'\001'）。

　　　　　　Hive 写文件机制：将 Row 写入文件时，主要调用 OutputFormat、SerDe 的Seriliazer，顺序与读取相反。可通过 desc formatted 表名；进行相关信息查看。当我们的数据格式比较特殊的时候，可以自定义 SerDe。

　　　　　　5、 partitioned by（分区）

　　　　在 hive Select 查询中一般会扫描整个表内容，会消耗很多时间做没必要的工作。有时候只需要扫描表中关心的一部分数据，因此建表时引入了 partition 分区概念。

　　　　分区表指的是在创建表时指定的 partition 的分区空间。一个表可以拥有一个或者多个分区，每个分区以文件夹的形式单独存在表文件夹的目录下。表和列名不区分大小写。分区是以字段的形式在表结构中存在，通过 describe table 命令可以查看到字段存在，但是该字段不存放实际的数据内容，仅仅是分区的表示。

　　　　　　6、 stored as sequencefile|textfile|rcfile

　　　　如果文件数据是纯文本，可以使用 stored as textfile。如果数据需要压缩，使用 STORED AS SEQUENCEFILE。

　　　　textfile是默认的文件格式，使用 delimited子句来读取分隔的文件。

　　　　　　7、 clustered by into num_buckets buckets（分桶）

　　　　对于每一个表（table）或者分，Hive 可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。Hive 也是针对某一列进行桶的组织。Hive 采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

　　　　把表（或者分区）组织成桶（Bucket）有两个理由：

　　　　（1）获得更高的查询处理效率。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。具体而言，连接两个在（包含连接列的）相同列上划分了桶的表，可以使用 Map 端连接（Map-side join）高效的实现。比如 JOIN 操作。对于 JOIN 操作两个表有一个相同的列，如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行 JOIN 操作就可以，可以大大较少 JOIN 的数据量。

　　　　（2）使取样（sampling）更高效。在处理大规模数据集时，在开发和修改查询的阶段，如果能在数据集的一小部分数据上试运行查询，会带来很多方便。

　　内部版要想映射成功文件位置必须在指定的路径下

　　/user/hive/warehouse/xxx.db
　　要想映射成功要根据文件指定具体的分隔符

　　row format delimited fields terminated by ","
　　要想映射成功必须保证定义的表字段顺序类型跟结构化文件中保持一致

　　本地模式

　　有时hive的输入数据量是非常小的。在这种情况下，为查询出发执行任务的时间消耗可能会比实际job的执行时间要多的多。对于大多数这种情况，hive可以通过本地模式在单台机器上处理所有的任务。对于小数据集，执行时间会明显被缩短。

如此一来，对数据量比较小的操作，就可以在本地执行，这样要比提交任务到集群执行效率要快很多。

set hive.exec.mode.local.auto=true;

　　当一个job满足如下条件才能真正使用本地模式：

1.job的输入数据大小必须小于参数：hive.exec.mode.local.auto.inputbytes.max(默认128MB)

2.job的map数必须小于参数：hive.exec.mode.local.auto.tasks.max(默认4)

　　　 3.job的reduce数必须为0或者1

　hive分隔符：

? 　　　　row format delimited(hive内置分隔符类) |serde（自定义或者其他分隔符类）

create table day_table (id int, content string) partitioned by (dt string) row format delimited fields terminated by ',';   ---指定分隔符创建分区表

　　　复杂类型的数据表指定分隔符

create table complex_array(name string,work_locations array<string>) row format delimited fields terminated by '\t' collection items terminated by ',';

数据：
1,zhangsan,唱歌:非常喜欢-跳舞:喜欢-游泳:一般般
2,lisi,打游戏:非常喜欢-篮球:不喜欢

create table t_map(id int,name string,hobby map<string,string>)
    row format delimited 
    fields terminated by ','
    collection items terminated by '-'
    map keys terminated by ':' ;

　　分区表（PARTITIONED BY）（创建子文件夹）

　　　　分区建表分为2种，一种是单分区，也就是说在表文件夹目录下只有一级文件夹目录。另外一种是多分区，表文件夹下出现多文件夹嵌套模式。

　　　　分区字段不是表中真实字段虚拟字段（它的值只是分区的标识值）

　　　　分区的字段不能是表中已经存有的字段否则编译出错

　　　 单分区建表语句：

　　　create table day_table (id int, content string) partitioned by (dt string);单分区表，按天分区，在表结构中存在id，content，dt三列。

　　　　导入数据

     LOAD DATA local INPATH '/root/hivedata/dat_table.txt' INTO TABLE day_table partition(dt='2017-07-07');

　　　　双分区建表语句：

create table day_hour_table (id int, content string) partitioned by (dt string, hour string);双分区表，按天和小时分区，在表结构中新增加了dt和hour两列。

　　　　导入数据

LOAD DATA local INPATH '/root/hivedata/dat_table.txt' INTO TABLE day_hour_table PARTITION(dt='2017-07-07', hour='08');

　　　　基于分区的查询：

SELECT day_table.* FROM day_table WHERE day_table.dt = '2017-07-07';

　　　　查看分区：

show partitions day_hour_table;

　　　　　　总的说来partition就是辅助查询，缩小查询范围，加快数据的检索速度和对数据按照一定的规格和条件进行管理。

　　分桶表（cluster by into num buckets）

分桶表是在文件的层面把数据划分的更加细致
分桶表定义需要指定根据那个字段分桶
分桶表分为几个桶最后自己设置的时候保持一致
分桶表的好处在于提高join查询效率减少笛卡尔积（交叉相差）的数量

　　　　　　#指定开启分桶（分成几个文件）

set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;//与分桶数相同，少于分桶数时不影响分桶，但是速度会降低

　　　　TRUNCATE TABLE stu_buck;

　　　　drop table stu_buck;

create table stu_buck(Sno int,Sname string,Sex string,Sage int,Sdept string)
clustered by(Sno) 
into 4 buckets
row format delimited
fields terminated by ',';

　　分桶表导入数据

insert overwrite table stu_buck
select * from student cluster by(Sno);//通过查询中间表向分桶表里导入数据，两张表的结构应该相同

--------------------------------华丽的分割线----------------------------------

　　分桶、排序等查询：cluster by 、sort by、distribute by

select * from student cluster by(Sno);

insert overwrite table student_buck
select * from student cluster by(Sno) sort by(Sage); 报错,cluster 和 sort 不能共存

　　对某列进行分桶的同时，根据另一列进行排序

insert overwrite table stu_buck
select * from student distribute by(Sno) sort by(Sage asc);

　　根据指定的字段把数据分成几桶，分成几桶取决于set mapreduce.job.reduces=？　　

　　当分的字段跟排序的字段不是同一个的时候，distribute by(xxx) sort by(yyy)

　　order by 根据指定的字段全局排序这时候不管环境设置set mapreduce.job.reduces为几，最终执行的时候都是1个，因为只有一个reducetask才能保证所有的数据来到一个文件中才能全局排序

　　总结：
　　　　cluster（分且排序，必须一样）==distribute（分） + sort（排序）（可以不一样）

　　内部表、外部表

　　　　建内部表（映射时需将文件上传到hdfs的固定路径；删除表时，不仅会删除表数据，还会将hdfs中hive路径中的文件删除）

create table student(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',';

　　　　建外部表（文件可以在hdfs的任意位置，但须使用location指定其路径）

create external table student_ext(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',' location '/stu';

　　　　内、外部表加载数据：

　　　　　　如果加local 表示数据来自于本地（hiveserver2服务运行所在机器的本地文件系统）

? 　　　　　如果不加local 表示数据在hdfs分布式文件系统的某个位置

- - 如果数据存在于本地 load加载数据是纯复制操作
  - 如果数据位于hdfs load加载数据就是移动操作

load data local inpath '/root/hivedata/students.txt' overwrite into table student;//local指的是hiveserver服务所在的服务器，而不是hivecli或beeline客户端所在的机器（生产环境中hiveserver和hivecli不再一台机器上）

load data inpath '/stu' into table student_ext;//hdfs中的文件

修改表

　　　　增加分区：

　　　　　　alter table table_name add partition (dt='20170101') location '/user/hadoop/warehouse/table_name/dt=20170101'; //一次添加一个分区

　　　　　　alter table table_name add partition (dt='2008-08-08', country='us') location '/path/to/us/part080808' partition (dt='2008-08-09', country='us') location '/path/to/us/part080809'; //一次添加多个分区

　　　　删除分区

　　　　　　alter table table_name drop if exists partition (dt='2008-08-08');

　　　　　　alter table table_name drop if exists partition (dt='2008-08-08', country='us');

　　　　修改分区

　　　　　　alter table table_name partition (dt='2008-08-08') rename to partition (dt='20080808');

　　　　添加列

　　　　　　alter table table_name add|replace columns (col_name string);

　　　　注： add 是代表新增一个字段，新增字段位置在所有列后面 (partition 列前 )

　　　　　　replace 则是表示替换表中所有字段。

　　　　修改列

　　　　test_change (a int, b int, c int);

　　　　　　alter table test_change change a a1 int; //修改 a 字段名