Spark2.0，重要更新与改进

发表: 2016-07-29 浏览: 2069

数据科学机器学习 Spark Python

01 引言

就在前天（2016.07.26号），Spark2.0正式版本发布了。来看看当今最世上最强大的全栈数据处理框架吧！

作为数据科学人员，如果一生只能学一个框架，那就先Spark！

In addition, this release includes over 2500 patches from over 300 contributors.

此版本超过2500个补丁，超过300位贡献者！
看看，这就是全人类集体智慧的结晶！也许是任何一个公司与团队短短几个月都很难达到的高度吧！

本篇是对官方的发布说明的粗略提取，选取了一部分个人认为比较重要的来翻译与说明，完整的官方发布日志，请参考：
https://spark.apache.org/releases/spark-release-2-0-0.html

The default build is now using Scala 2.11 rather than Scala 2.10
编译Spark版本的环境从Scala 2.10变成了2.11。标志着以后写Scala程序，也最好使用2.11来编译了。
【Deprecation】Support for Java 7，Support for Python 2.6
不建议使用的版本，java7和Python2.6。
另外，Spark对Python3的支持已经不错了，如果使用PySpark，建议直接使用Python3，要少些麻烦。
Spark 2.0 no longer requires a fat assembly jar for production deployment.
部署到生产环境中，不再需要那个臃肿的assembly文件了（貌似是对Scala开发的福利）。

Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.
在Scala语言与Java语言中，统一了DataFrame与Dataset数据结构。Python和R中，因为语言本身缺少类型安全机制，因此DataFrame还是主要的编程接口。
SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.
SparkSession成为了新的入口，统一了旧的SQLContext与HiveContext。但为了向后兼容，这两个依然可用。
Native CSV data source, based on Databricks’ spark-csv module
官方支持了CSV作为数据源（更方便了，不用再像以前一个split来手动解析）

A native SQL parser that supports both ANSI-SQL as well as Hive QL
本个本地的SQL解析引擎，同时支持Ansi-sql（sql-2003标准，也是ansi的最新版本）与Hive QL。
Substantial (2 - 10X) performance speedups for common operators in SQL and DataFrames via a new technique called whole stage code generation
经过验证，通过一个新的被称为”全段代码生成”的技术，对常用的SQL操作和DataFrame，性能有2-10倍的提升。
Improved ORC performance
提升了ORC存储格式的性能，这也正是HDP2大力支持的数据格式。
Uncorrelated Scalar Subqueries，Correlated Scalar Subqueries
相关或者不相关的标量子查询（可以直接在select中的标量处直接写子查询）。
/in/not in/EXISTS/not exists predicate subqueries (in WHERE/HAVING clauses)
在Where与having条件中，可以写断言式子查询，支持in/not in/exists/not exists。

The DataFrame-based API is now the primary API. The RDD-based API is entering maintenance mode.
机器学习，基于DataFrame的API变成了主要的API，基于RDD的API进入维护模式。
ML persistence: The DataFrames-based API provides near-complete support for saving and loading ML models and Pipelines in Scala, Java, Python, and R
ML库模型的持久化：基于DataFrame的api提供几乎完整的保存与加载模型，和Pipelines的支持。
Python: PySpark now offers many more MLlib algorithms, including LDA, Gaussian Mixture Model, Generalized Linear Regression, and more.
Pyspark提供了更多的算法，如LDA（主题模型），高斯混合和广义线性回归。

spark 2.0新特性：
http://www.iteblog.com/archives/1721

想了解更多的信息，请关注公众号：云戒云(yunjie-talk)

要回复文章请先登录或注册