使用目的:通过访问hadoop数据(hive、hdfs),实现大数据访问及kettle可视化数据处理
一、相关环境配置及查看
1、sqoop安装
命令安装:yum install sqoop
相关函数:yum info sqoop、yum help、yum remove sqoop
管理界面安装:
选择集群名字右侧服务,然后选择sqoop服务(建议安装在inceptor以外的一个节点)
2、hdfs目录
hdfs dfs -mkdir /dwbi/basicDw/filename
Error:
Mkdir permission denied
To do:
sudo -u hdfs hdfs dfs -chmod 777 /dwbi/basicDw
3、查看版本信息
rpm -qa | grep sqoop
rpm -qa|grep transwarp
二、错误记录
1、安装sqoop时节点没有hadoop-mapreduce目录,即没有yarn的功能,所以无法配置环境变量,选择有这个yarn功能的节点安装即可(yum install sqoop)
2、
.1 sqoop导入时出现问题:permission denied user=root,access=write,inode="/dwbi/basicDw"
解决:sudo -u hdfs hdfs dfs -chmod 777 /dwbi/basicDw
导入数据:
Sqoop import --conenct jdbc:oracle:thin:@//10.0.0.0:1521/orcl --username CDT_BI --password oracle --query “select * from drivinglicense where 1=1 and \$CONDITIONS” -target-dir /dwbi/basicDw/drivinglicense -outdir /tem/drivinglicense -m1
查看:
Hdfs dfs –tail –f /dwbi/basicDw/drivinglicense/part-m-00000
hdfs dfs -cat /dwbi/basicDw/drivinglicense/part-m-00000
.2 导入表的方式使用直接使用sqoop到hive:
Sqoop import --hive-import --connect jdbc:oracle:thin:@//10.0.0.0:1521/orcl --username CDT_BI --password oracle --verbose -m 1 --table ACD_FILE
注意:
用户名和表名必需是大写的,除非它们在创建的时候是名字是放在引号中的大小写混合的形式;Sqoop会把Oracle中的表自动在hive中中创建,不用手工创建;表自动创建在default数据库中,如果想导入到其它库中,可加--hive-table mydb.tbname方式导入。
参考:http://blog.csdn.net/lichangzai/article/details/19206479
3、sqoop创建job
.1 创建语句
sqoop job --create drvSqoopJob \
--import \
--connect jdbc:oracle:thin:@//10.0.0.0:1521/orcl \
--username CDT_BI \
--password oracle \
--query "select * from drivinglicense where \$CONDITIONS" \
--target-dir /dwbi/basicDw/drivinglicense -outdir /tem/drivinglicense --split-by dabh -m 4 \
--null-string '\\N' --null-non-string '\\N'
.2 相关命令
sqoop job --list
sqoop job --show drvJob
sqoop job --delete drvJob
sqoop job --exec drvJob
.3 问题:sqoop创建job时报错误
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.Please set $ACCUMULO_HOME to the root of your Accumulo installation
解决办法:mkdir /var/lib/accumulo
export ACCUMULO_HOME=/var/lib/accumulo
.4 问题:sqoop创建job时错误
Warning: /home/hadoop10/sqoop-1.4.5/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop10/sqoop-1.4.5/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: $HADOOP_HOME is deprecated
解决办法:
把sqoop/bin/configure-sqoop里面的如下两段内容注释掉就可以了。
#if [ -z "${HCAT_HOME}" ]; then
# if [ -d "/usr/lib/hive-hcatalog" ]; then
# HCAT_HOME=/usr/lib/hive-hcatalog
# elif [ -d "/usr/lib/hcatalog" ]; then
# HCAT_HOME=/usr/lib/hcatalog
# else
# HCAT_HOME=${SQOOP_HOME}/../hive-hcatalog
# if [ ! -d ${HCAT_HOME} ]; then
# HCAT_HOME=${SQOOP_HOME}/../hcatalog
# fi
# fi
#fi
#if [ -z "${ACCUMULO_HOME}" ]; then
# if [ -d "/usr/lib/accumulo" ]; then
# ACCUMULO_HOME=/usr/lib/accumulo
# else
# ACCUMULO_HOME=${SQOOP_HOME}/../accumulo
# fi
#fi
## Moved to be a runtime check in sqoop.
#if [ ! -d "${HCAT_HOME}" ]; then
# echo "Warning: $HCAT_HOME does not exist! HCatalog jobs will fail."
# echo 'Please set $HCAT_HOME to the root of your HCatalog installation.'
#fi
#if [ ! -d "${ACCUMULO_HOME}" ]; then
# echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail."
# echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.'
#fi
.5 问题:sqoop 创建job时候
Sqoop job --create drvJob \
--import \
--connect jdbc:orale:thin:@//10.0.0.0:1521/orcl \
--username cdt_bi \
--password oracle \
--table drivinglicense --4
出现错误:
tool.BaseSqoopTool: Unrecognized argument: root 14/07/16 23:23:36 ERROR tool.BaseSqoopTool: Unrecognized argument: -password 14/07/16 23:23:36 ERROR tool.BaseSqoopTool: Unrecognized argument: root 14/07/16 23:23:36 ERROR tool.BaseSqoopTool: Unrecognized argument: --table 14/07/16 23:23:36 ERROR tool.BaseSqoopTool: Unrecognized argument: patient 14/07/16 23:23:36 ERROR tool.BaseSqoopTool: Unrecognized argument: -m 14/07/16 23:23:36 ERROR tool.BaseSqoopTool: Unrecognized argument: 1 14/07/16 23:23:36 ERROR tool.BaseSqoopTool:
解决办法:
修改--improt 为-- import
参考:http://stackoverflow.com/questions/24787616/unable-to-create-job-in-sqoop
经验:
•m 和--split-by 参数,分别控制着Sqoop从RDB中抽取数据的并发度和数据分片列。 例如 -m 4 --split-by id 代表按id列把数据分成4份,有4个map同时并行抽取。
• 整个抽取数据的瓶颈主要在下面两个方面
–RDB和Hadoop集群的网络带宽
–RDB的IO限制
•抽取数据过程中的性能可以按单Map 5~10MB/s 做估算
•注意:Map个数并非越多越好,过多的Map会导致RDB方面发生IO抢占,反而降低了整体抽数效率
•实际应用时,map并发数应采用逐渐增加的方式
.6 kettle访问hdfs文件需要建立hadoop cluster ,对应配置信息:
cluster name: 随便起
hdfs-hostname:集群活动管理节点,port:8020
username和password不用填(如果集群开启了kerberos认证,则需要填入对应hdfs的用户密码)
jobTracker-hostnames:resource manager对应节点ip,port:8032
zooKeeper-hostname:对应功能节点ip,port:2181
Oozie:不用填
(注:测试时出现奇葩问题可换不同kettle版本得以解决)
.7 kettle-字段类型
问题:kettle5.1表输入时不可以有varchar字段类型,负责出现错误kettle canceloperation failed out of sequence response tpye unset 等错误
.8 kettle-hadoop file output
问题:使用hadoop file output 时候(配置时不用输入用户名密码,填入server:33.0.0.0和端口:8020即可),直接获取字段,然后执行错误OutOfMemoryError: Requested array size exceeds VM limit
解决办法:在字段tab下把长度调整下,直接获取时默认太长
.9 hive-中文乱码
问题:hive查询中文乱码
解决办法:ALTER TABLE DRV_FACT_ZJCXSL_EX SET SERDEPROPERTIES ('serialization.encoding'='GBK');
.10 hive中文乱码
1先用Iconv 转换文件为utf-8
2或者建表时加上serdeproperties('serialization.encoding'='gbk')
待续!!!
华青莲日常点滴,方便自己成长他人。