GBDT 用于分类和回归的spark示例

发表: 2017-03-31 浏览: 5939

深度学习

GBDT是推荐系统中应用非常广泛的算法。

GBDT 是由决策树集成而来的，这种算法不断地迭代式训练决策树算法，目标是最小化损失函数。跟决策树类似，GBDT能够捕捉到非线性特征，也能发掘特征交互作用。

spark.mllib 支持GBDT的二分类问题和回归问题，spark.mllib中的GBDT是基于决策树来实现的，能够处理连续性和离散性的特征。

基本算法

GBDT 迭代式训练一系列决策树，每次迭代中，算法利用当前的集成算法来预测每个训练样本的标签，然后跟真实的标签相比较。为了重点关注预测效果较差的训练样本，数据集会被重打标签。所以，在下次迭代中，决策树能够有助于纠正之前的分错样本相应的预测。

重打标签的具体机制是由误差函数定义的。每次迭代中，GBDT 都能够减小训练集上的误差函数。

误差函数

下表给出了spark.mllib中GBDT支持的目标函数（损失函数），需要注意的是，每个目标函数要么适用于分类问题，要么适用于回归问题，并不是都适用的。

使用建议

下面给出了几个使用GBDT的建议，

loss: 目标函数。不同的目标函数适用不同的问题，比如分类或回归。不同的误差函数可能给出差别很大的最终结果，这依赖于数据集。
numIterations: 迭代次数。这个参数设置了集成算法中决策树的个数。每次迭代都会生成一棵决策树，增加迭代次数可以增加模型的表达能力，提高训练集上的准确率，然而，如果迭代次数过多，测试集上的准确率可能会降低。
learningRate: 学习率。这个参数不需要调节，如果算法不稳定，减小学习率可能会提高稳定性。
algo: 算法。算法或任务（分类或回归）。

训练时的验证

GBDT 中树的个数较多时，可能会过拟合。为了防止过拟合，通常需要训练时加以验证。GBDT 中提供的的 runWithValidation 方法可以用来验证效果。需要一对 RDD 作为参数，第一个是训练集，第二个是验证集。

训练结束的条件是验证误差的提升量不超过某个限定值，这个限定值即为BoostingStrategy中的validationTol。实际问题中，验证误差会先减小后增大。验证误差不是单调变化的，用户可以设置一个足够大的负限定值，然后利用 evaluationEachIteration 来检测验证曲线，进而调节训练次数，其中evalutionnEachIteration 表示每次迭代中的误差或损失。

示例

分类问题

示例中给出了如何加载 LIBSVM data file，然后将其解析成LabeledPoint RDD，然后利用GBDT来分类，其中目标函数是对数误差，测试误差用来衡量算法的准确率。

API细节可以参考 GradientBoostedTrees docs 和 GradientBoostedTreesModel docs 。.

Scala 代码如下

import org.apache.spark.mllib.tree.GradientBoostedTrees

import org.apache.spark.mllib.tree.configuration.BoostingStrategy

import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel

import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split the data into training and test sets (30% held out for testing)

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

// Train a GradientBoostedTrees model.

// The defaultParams for Classification use LogLoss by default.

val boostingStrategy = BoostingStrategy.defaultParams("Classification")

boostingStrategy.numIterations = 3 // Note: Use more iterations in practice.

boostingStrategy.treeStrategy.numClasses = 2

boostingStrategy.treeStrategy.maxDepth = 5

// Empty categoricalFeaturesInfo indicates all features are continuous.

boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Evaluate model on test instances and compute test error

val labelAndPreds = testData.map { point =>

  val prediction = model.predict(point.features)

  (point.label, prediction)}

val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()

println("Test Error = " + testErr)

println("Learned classification GBT model:\n" + model.toDebugString)

// Save and load model

model.save(sc, "target/tmp/myGradientBoostingClassificationModel")

val sameModel = GradientBoostedTreesModel.load(sc,

  "target/tmp/myGradientBoostingClassificationModel")

完整代码可以参考spark repo中的"examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostingClassificationExample.scala" 。

python 代码如下

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel

from pyspark.mllib.util import MLUtils

# Load and parse the data file.

data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

# Split the data into training and test sets (30% held out for testing)

(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GradientBoostedTrees model.

#  Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.

#         (b) Use more iterations in practice.

model = GradientBoostedTrees.trainClassifier(trainingData,

                                             categoricalFeaturesInfo={}, numIterations=3)

# Evaluate model on test instances and compute test error

predictions = model.predict(testData.map(lambda x: x.features))l
abelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

print('Test Error = ' + str(testErr))

print('Learned classification GBT model:')

print(model.toDebugString())

# Save and load model

model.save(sc, "target/tmp/myGradientBoostingClassificationModel")

sameModel = GradientBoostedTreesModel.load(sc,

                                           "target/tmp/myGradientBoostingClassificationModel")

完整代码可以参考spark repo中的 "examples/src/main/python/mllib/gradient_boosting_classification_example.py" 。

完整代码

回归问题

下面给出了如何加载，将其解析成 LabeledPoint RDD，然后用GBDT来实现回归，其中目标函数是均方误差，均方误差用来衡量拟合的好坏。

API细节可以参考 GradientBoostedTrees docs 和 GradientBoostedTreesModel docs

Scala 代码如下

import org.apache.spark.mllib.tree.GradientBoostedTrees

import org.apache.spark.mllib.tree.configuration.BoostingStrategy

import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel

import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split the data into training and test sets (30% held out for testing)

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

// Train a GradientBoostedTrees model.

// The defaultParams for Regression use SquaredError by default.

val boostingStrategy = BoostingStrategy.defaultParams("Regression")

boostingStrategy.numIterations = 3 // Note: Use more iterations in practice.

boostingStrategy.treeStrategy.maxDepth = 5

// Empty categoricalFeaturesInfo indicates all features are continuous.

boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Evaluate model on test instances and compute test error

val labelsAndPredictions = testData.map { point =>

  val prediction = model.predict(point.features)

  (point.label, prediction)}

val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()

println("Test Mean Squared Error = " + testMSE)

println("Learned regression GBT model:\n" + model.toDebugString)

// Save and load model

model.save(sc, "target/tmp/myGradientBoostingRegressionModel")

val sameModel = GradientBoostedTreesModel.load(sc,

  "target/tmp/myGradientBoostingRegressionModel")

完整代码参加spark repo中的 "examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostingRegressionExample.scala" 。

python 代码如下：

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel

from pyspark.mllib.util import MLUtils# Load and parse the data file.

data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

# Split the data into training and test sets (30% held out for testing)

(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GradientBoostedTrees model.

#  Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.

#         (b) Use more iterations in practice.

model = GradientBoostedTrees.trainRegressor(trainingData,

                                            categoricalFeaturesInfo={}, numIterations=3)

# Evaluate model on test instances and compute test error

predictions = model.predict(testData.map(lambda x: x.features))

labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() /\    float(testData.count())

print('Test Mean Squared Error = ' + str(testMSE))

print('Learned regression GBT model:')

print(model.toDebugString())

# Save and load model

model.save(sc, "target/tmp/myGradientBoostingRegressionModel")

sameModel = GradientBoostedTreesModel.load(sc, "target/tmp/myGradientBoostingRegressionModel")