Apache CarbonData 是由华为开发、开源并支持Apache Hadoop的列式存储文件格式,支持索引、压缩以及编解码等,其目的是为了实现同一份数据达到多种需求,而且能够实现更快的交互查询。目前该项目正处于Apache孵化过程中,当前最新版为2017.1.29发布的v 1.0,本实验及文档针对v1.0进行(Spark 2.1.0 & Hadoop 2.7.2)。
1 安装Apache Thrift
1.1 安装Boost
下载boost.org
./bootstrap.sh
sudo ./b2 threading=multi address-model=64 variant=release stage install
1.2 安装libevent
下载libevent
./configure --prefix=/usr/local
make
sudo make install
1.3 编译Apache Thrift
下载Apache Thrift
./configure --prefix=/usr/local/ --with-boost=/usr/local --with-libevent=/usr/local
2 编译CarbonData
编译前系统中要安装好如下的环境:
- 类Unix系统(Linux,Mac OS X)
- Git
- Apache Maven
- Java 7 or 8
- Scala 2.11
- Apache Thrift(上一步已安装)
2.1 Clone CarbonData
git clone https://github.com/apache/incubator-carbondata.git
2.2 maven源切换
编译过程中根据实际情况切换maven源,实验环境上配置了2个,个别情况下需要手动下载某些包
<mirrors>
<mirror>
<id>repo2</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://repo2.maven.org/maven2/</url>
</mirror>
</mirrors>
2.3 Build CarbonData
mvn -DskipTests clean package
mvn -DskipTests -Pspark-1.5 -Dspark.version=1.5.0 clean package
mvn -DskipTests -Pspark-1.5 -Dspark.version=1.5.1 clean package
mvn -DskipTests -Pspark-1.5 -Dspark.version=1.5.2 clean package
mvn -DskipTests -Pspark-1.6 -Dspark.version=1.6.0 clean package
mvn -DskipTests -Pspark-1.6 -Dspark.version=1.6.1 clean package
mvn -DskipTests -Pspark-1.6 -Dspark.version=1.6.2 clean package
mvn -DskipTests -Pspark-2.1 -Dspark.version=2.1.0 -Phadoop-2.7.2 clean package
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache CarbonData :: Parent ........................ SUCCESS [ 2.145 s]
[INFO] Apache CarbonData :: Common ........................ SUCCESS [23:36 min]
[INFO] Apache CarbonData :: Core .......................... SUCCESS [05:56 min]
[INFO] Apache CarbonData :: Processing .................... SUCCESS [ 6.273 s]
[INFO] Apache CarbonData :: Hadoop ........................ SUCCESS [ 3.534 s]
[INFO] Apache CarbonData :: Spark Common .................. SUCCESS [ 40.358 s]
[INFO] Apache CarbonData :: Spark2 ........................ SUCCESS [ 55.588 s]
[INFO] Apache CarbonData :: Spark Common Test ............. SUCCESS [ 28.990 s]
[INFO] Apache CarbonData :: Assembly ...................... SUCCESS [ 6.670 s]
[INFO] Apache CarbonData :: Spark2 Examples ............... SUCCESS [ 9.153 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 32:05 min
[INFO] Finished at: 2017-02-03T16:55:09+08:00
[INFO] Final Memory: 98M/895M
[INFO] ------------------------------------------------------------------------
3 在Standalone Spark集群安装和配置CarbonData
前提条件:
- Hadoop的HDFS和YARN已经安装并正常运行
- Spark已经安装并正常运行
- CarbonData用户必须有访问HDFS的权限
经过上节的编译步骤后,复制./assembly/target/scala-2.11/carbondata_xxx.jar放到$SPARK_HOME/carbonlib目录下(carbonlib需要手动创建)
mkdir $SPARK_HOME/carbonlib
cp ./assembly/target/scala-2.11/carbondata_2.11-1.0.0-incubating-shade-hadoop2.7.2.jar $CARBONDATA/spark-2.1.0/carbonlib/
cp $CARBONDATA/carbondata-parent-1.0.0-incubating/conf/carbon.properties.template $CARBONDATA/spark-2.1.0/conf/carbon.properties
cp -r $CARBONDATA/carbondata-parent-1.0.0-incubating/processing/carbonplugins $CARBONDATA/spark-2.1.0/carbonlib/
SPARK_CLASSPATH=$SPARK_CLASSPATH:${SPARK_HOME}/carbonlib/*
carbon.kettle.home $SPARK_HOME/carbonlib/carbonplugins
spark.driver.extraJavaOptions -Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.properties
spark.executor.extraJavaOptions -Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.properties
carbon.store.location=hdfs://localhost:9000/opt/CarbonStore
carbon.ddl.base.hdfs.url=hdfs://localhost:9000/opt/data
carbon.kettle.home=$<SPARK_HOME>/carbonlib/carbonplugins
4 通过Spark Shell 调用carbondata
4.1 准备测试数据文件sample.csv
内容如下:
id,name,city,age
1,david,shenzhen,31
2,eason,shenzhen,27
3,jarry,wuhan,35
并将sample.csv上传到HDFS上。
4.2 设置carbondata jar包变量并启动spark-shell
carbondata_jar=$SPARK_HOME/carbonlib/carbondata_2.11-1.0.0-incubating-shade-hadoop2.7.2.jar
spark-shell --master spark://localhost:7077 --jars ${carbondata_jar}
注意,在启动时可能会出现spark sql相关的错误或无法创建HiveCliSession等错误,此时仅需将$HIVE_HOME/conf/hive-site.xml复制到$SPARK_HOME/conf/下即可。
4.3 验证carbondata文件格式是否可用
scala> import org.apache.spark.sql.SparkSession
scala> import org.apache.spark.sql.CarbonSession._
scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://localhost:9000/opt/CarbonStore")
scala> carbon.sql("CREATE TABLE IF NOT EXISTS test_table(id string, name string, city string, age Int) STORED BY 'carbondata'")
scala> carbon.sql("LOAD DATA INPATH 'hdfs://localhost:9000/resources/sample.csv' INTO TABLE test_table")
scala> carbon.sql("SELECT * FROM test_table").show()
+---+-----+--------+---+
| id| name| city|age|
+---+-----+--------+---+
| 1|david|shenzhen| 31|
| 2|eason|shenzhen| 27|
| 3|jarry| wuhan| 35|
+---+-----+--------+---+
scala> carbon.sql("SELECT city, avg(age), sum(age) FROM test_table1 GROUP BY city").show()
+--------+--------+--------+
| city|avg(age)|sum(age)|
+--------+--------+--------+
| wuhan| 35.0| 35|
|shenzhen| 29.0| 58|
+--------+--------+--------+
Reference
Apache Thrift OS X Setup
Quick Start
Apache CarbonData Github
Installation Guide