random forests用于分类和回归的spark示例

发表: 2017-04-01 浏览: 4128

机器学习深度学习

紧接上文，本文谈谈随机森林。

随机森林是由多个决策树集成得到的。它是一种比较成功地机器学习算法，不仅可以用于分类问题，也可以用于回归问题。随机森林通过结合多个决策树来降低过拟合的风险。随机森林可以捕捉到非线性特征，也可以学到特征的交互作用。

spark.mllib 中的随机森林支持二分类和多分类以及回归问题，其中包含连续特征和离散特征，spark.mllib中随机森林的实现是基于决策树来实现的。

基本算法

随机森林中的每个决策树独立训练，因此训练过程可以并行实现。算法的训练过程中加入了随机性，所以每个决策树都略有不同。将每个决策树的预测结果合并之后可以降低预测的方差，并且可以提高在测试集上的效果。

训练过程

训练过程中的随机性包括：

对于每个决策树，每次迭代都从原始数据集中采样，使得每次训练样本集不同（就是传说中的 bootstrapping）
决策树中的每个节点都对应不同的具有随机性的特征子集

除了上述随机性，决策树的训练过程跟一般的决策树训练方式相同。

预测过程

对于一个新样本，为了对其进行预测，随机森林需要集成每个决策树的预测结果。分类和预测问题中，随机森林的集成策略是不同的。

分类问题：多数投票策略。每个决策树的预测结果看作对该类的投票，哪个类得票最多，样本就被分为哪一类。

回归问题：取均值。每个决策树都给出一个实数值。最终预测结果是所有决策树预测结果的均值。.

使用建议

使用决策树的过程中存在若干需要注意的参数。

前两个是最重要的参数，调节这两个参数通常可以提高性能：

numTrees:随机森林中决策树的个数

增加决策树的个数会降低预测的方差，提高模型的预测准确率
训练时间大致上随决策树个数的增加而线性增加

maxDepth: 森林中每棵树的深度

增加树的深度能够使得模型表达能力更强，然而，比较深的树耗时较长，并且容易过拟合
通常，如果只有一颗决策树，增加深度是可以接受的。一棵树相对随机森林更容易出现过拟合，因为随机森林可以通过对多个决策树取平均（折衷）进而降低方差

下面两个参数通常不需调节，但是，调节这两个参数可以加速训练过程。

subsamplingRate: 这个参数决定了森林中每棵树中样本集的大小，该值是决策树中样本数在整个样本集中样本数的占比。默认的推荐值为1.0，减小该值可以加速训练过程。
featureSubsetStrategy: 树中每个节点用来分割树的特征个数。它通常是特征数的函数或是特征总数乘以一个分数得到。减小这个数可以加速训练，如果过小有时会影响性能。

示例

分类

下面的示例给出了如何加载 LIBSVM data file, 并将其解析为RDD格式的 LabeledPoint，然后利用随机森林来分类。其中的测试误差用来衡量算法的准确率。

Scala示例如下

AP细节可以参考 RandomForest Scala docs 和 RandomForestModel Scala docs

import org.apache.spark.mllib.tree.RandomForest

import org.apache.spark.mllib.tree.model.RandomForestModel

import org.apache.spark.mllib.util.MLUtils 

// Load and parse the data file.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split the data into training and test sets (30% held out for testing)

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.

// Empty categoricalFeaturesInfo indicates all features are continuous.

val numClasses = 2

val categoricalFeaturesInfo = Map[Int, Int]()

val numTrees = 3 // Use more in practice.

val featureSubsetStrategy = "auto" // Let the algorithm choose.

val impurity = "gini"

val maxDepth = 4

val maxBins = 32

val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error

val labelAndPreds = testData.map { point =>

  val prediction = model.predict(point.features)

  (point.label, prediction)}

val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()

println("Test Error = " + testErr)

println("Learned classification forest model:\n" + model.toDebugString)

// Save and load model

model.save(sc, "target/tmp/myRandomForestClassificationModel")

val sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")

Spark Repo中的"examples/src/main/scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala" 给出了完整示例代码。

回归

本例给出了如何加载 LIBSVM data file, 并将其解析成 RDD 格式的 LabeledPoint，然后利用随机森林进行回归。均方误差 Mean Squared Error (MSE) 用来衡量拟合的效果。

Scala示例如下

API 细节参考 RandomForest Scala docs 和 RandomForestModel Scala docs

import org.apache.spark.mllib.tree.RandomForest

import org.apache.spark.mllib.tree.model.RandomForestModel

import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split the data into training and test sets (30% held out for testing)

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.

// Empty categoricalFeaturesInfo indicates all features are continuous.

val numClasses = 2

val categoricalFeaturesInfo = Map[Int, Int]()

val numTrees = 3 // Use more in practice.

val featureSubsetStrategy = "auto" // Let the algorithm choose.

val impurity = "variance"

val maxDepth = 4

val maxBins = 32

val model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error

val labelsAndPredictions = testData.map { point =>

  val prediction = model.predict(point.features)

  (point.label, prediction)}

val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()

println("Test Mean Squared Error = " + testMSE)

println("Learned regression forest model:\n" + model.toDebugString)

// Save and load model

model.save(sc, "target/tmp/myRandomForestRegressionModel")

val sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestRegressionModel")

Spark repo中 "examples/src/main/scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala" 给出了完整示例

最后总结下随机森林的优缺点

优点

1、训练结果能够给出哪些特征比较重要

2、模型泛化能力强

3、训练速度快，容易并行实现

4、训练过程中，能够学到特征间的相互作用

5、实现比较简单

6、能够较好地处理不平衡 (unbalanced) 的数据集

7、对特征确实具有鲁棒性

缺点：

1、噪声较大时，容易过拟合

2、划分较多的属性影响更大，此时属性权值不可信

最最后给出上篇短文的最后一张图片，以示对比。顺便预告下，下篇看看SVD++。

参考资料

http://spark.apache.org/docs/latest/mllib-ensembles.html

0 个评论

要回复文章请先登录或注册