本文讲述下远程连接Hive遇到的一些问题
一、CentOS连接Hive
尝试使用CentOS 7远程连接Hive,首先安装好需要的相关包
pip install pyhive
pip install thrift
yum install cyrus-sasl-devel.x86_64
pip install sasl
pip install thrift-sasl==0.2.1
1.连接Hive
$python
>>> from pyhive import hive
>>> conn = hive.Connection(host='192.168.12.5', port=10000, username='root', database='behavior_labels')
>>> cur=conn.cursor()
>>> cur.execute('SHOW TABLES')
>>> cur.fetchall()
另外:当使用create_engine连接hive时需要安装pyhive,如果报错ImportError: cannot import name TFrozenDict时,升级thrift版本。
目前pandas使用to_sql存入数据到hive只能存入一行就失败。
from sqlalchemy import create_engine
engine = create_engine('hive://root@192.168.12.67:10000/item_recommand')
data.to_sql('user', con=engine,index=False, if_exists='append')#只能存入一行数据
from impala.dbapi import connect
from impala.util import as_pandas
conn = connect(host='192.168.12.67', port=10000, auth_mechanism='PLAIN', user='root', password='20150228', database='item_recommand')
cur=conn.cursor()
cur.execute("insert into user values('4','1',1,'5','2'),('5','1',1,'5','2')")#hive版本1.1,插入速度很慢
2.插入数据
读取数据还OK,但是集群中用pyspark插入数据的话,速度很慢,而且容易锁表,还是在集群中生成临时表,然后插入hive比较合适
from pyspark import SparkContext,SparkConf
from pyspark.sql import HiveContext
conf = SparkConf().setAppName(u'recommend')#.setMaster("local[*]")
#sparkContext = SparkContext(conf=conf)
sc= SparkContext.getOrCreate()
hive_context = HiveContext(sc)
hive_context.sql('use recommand')
df=hive_context.createDataFrame([(1,2,3),(4,5,6)],['vipid','recommand','rank'])
df.registerTempTable("df")
hive_context.sql('insert into table itemcf select * from df')
二、Windows连接Hive
1.使用pyhive连接hive
(连接未成功,可跳过)
安装好连接hive所需的包
pip install pyhive
pip install thrift
pip install sasl #此步需要先安装visualcppbuildtools_full.exe,不成功的话下载https://www.lfd.uci.edu/~gohlke/pythonlibs/的包
pip install thrift_sasl
C:\Users\Administrator> python
>>> from pyhive import hive
>>> conn = hive.Connection(host='192.168.12.5', port=10000, username='root', database='behavior_labels')
报错:thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: Unable to find a callback: 2'
经过在GitHub和Stack Overflow查看资料,推测是sasl不支持windows版本,则使用impala连接hive
2.使用impala连接hive
1)当python版本是2.7时
pip2 install impyla
pip2 install thrift==0.9.3
pip2 install thrift_sasl #失败的话先安装相关依赖包sudo yum install cyrus-sasl-devel
>>> from impala.dbapi import connect
>>> conn = connect(host='192.168.12.5', port=10000, auth_mechanism='PLAIN', user='root', password='*', database='behavior_labels')
>>> cur=conn.cursor()
>>> cur.execute('SHOW TABLES')
>>> cur.fetchall()
显示hive中的table表即成功
2)python版本是3.6时
安装前需把相关的包卸载干净,然后重新安装对应的版本
pip3 uninstall sasl #运行时报错module 'sasl' has no attribute 'Client',说明该包没有删除干净,需要手动删除文件
pip3 install impyla
pip3 install pure-sasl
pip3 install thrift_sasl==0.2.1 --no-deps
>>> from impala.dbapi import connect
运行报错:thriftpy.parser.exc.ThriftParserError: ThriftPy does not support generating module with path in protocol 'c' ,需要在文件"C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\thriftpy\parser\parser.py"中第488行代码
if url_scheme == '':
with open(path) as fh:
data = fh.read()
改为
if len(url_scheme) <=1:
with open(path) as fh:
data = fh.read()
>>> conn = connect(host='192.168.12.5', port=10000, auth_mechanism='PLAIN', user='root', password='5606603', database='behavior_labels')
报错:TypeError: can't concat str to bytes 需要在File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-p
ackages\thrift_sasl\__init__.py"第94行代码
def _send_message(self, status, body):
header = struct.pack(">BI", status, len(body))
self._trans.write(header + body)
self._trans.flush()
改为
def _send_message(self, status, body):
header = struct.pack(">BI", status, len(body))
if(type(body) is str):
body = body.encode()
self._trans.write(header + body)
self._trans.flush()
到此基本就没问题了,打开python3,开始连接hive
>>> from impala.dbapi import connect
>>> conn = connect(host='192.168.12.5', port=10000, auth_mechanism='PLAIN', user='root', password='*', database='behavior_labels')
>>> cur=conn.cursor()
>>> cur.execute('SHOW TABLES')
>>> cur.fetchall()
参考资料:https://blog.csdn.net/Xiblade/article/details/82318294,https://www.jianshu.com/p/cb2b864b4aca