作业一的数据量稍微有点大,目的就是为了考察学员对大数据的处理和建模能力。多位学员在做题过程中提到数据量大模型跑起来慢的问题,此时有学员提出利用并行方式跑模型的方式来解决大数据运行问题是一个非常好的思路。关于作业一的建模这边使用Microsoft R Service中的基于企业级大数据机器学习平台的MicrosoftML包中的快速机器学习算法来跑模型,期望能给学员遇到大数据情况下能运用R语言进行建模工作。
1、 数据类型转换:因”gender”实际代表男性、女性用户,故需要将“gender”变量转换成因子变量,且因子水平用“F”替换1,“M”替换2;将“fraudRisk”变量也转换成因子变量。
答:首先将数据导入到R中,并查看数据维度:
> # 导入数据
> ccFraud <- read.csv("ccFraud.csv")
> # 查看数据维度
> str(ccFraud)
'data.frame': 10000000 obs. of 9 variables:
$ custID : int 1 2 3 4 5 6 7 8 9 10 ...
$ gender : int 1 2 2 1 1 2 1 1 2 1 ...
$ state : int 35 2 2 15 46 44 3 10 32 23 ...
$ cardholder : int 1 1 1 1 1 2 1 1 1 1 ...
$ balance : int 3000 0 0 0 0 5546 2000 6016 2428 0 ...
$ numTrans : int 4 9 27 12 11 21 41 20 4 18 ...
$ numIntlTrans: int 14 0 9 0 16 0 0 3 10 56 ...
$ creditLine : int 2 18 16 5 7 13 1 6 22 5 ...
$ fraudRisk : int 0 0 0 0 0 0 0 0 0 0 ...
ccFraud数据集一共有一千万行9列,各列均为整型变量。按照题目要求先将“gender”变量转变为因子型,且因子水平用“F”替换1,“M”替换2。代码如下:
> ccFraud$gender <- factor(ifelse(ccFraud$gender==1,'F','M'))
> str(ccFraud)
'data.frame': 10000000 obs. of 9 variables:
$ custID : int 1 2 3 4 5 6 7 8 9 10 ...
$ gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 2 1 1 2 1 ...
$ state : int 35 2 2 15 46 44 3 10 32 23 ...
$ cardholder : int 1 1 1 1 1 2 1 1 1 1 ...
$ balance : int 3000 0 0 0 0 5546 2000 6016 2428 0 ...
$ numTrans : int 4 9 27 12 11 21 41 20 4 18 ...
$ numIntlTrans: int 14 0 9 0 16 0 0 3 10 56 ...
$ creditLine : int 2 18 16 5 7 13 1 6 22 5 ...
$ fraudRisk : int 0 0 0 0 0 0 0 0 0 0 ...
将“fraudRisk”变量也转换成因子变量,代码如下:
>ccFraud$fraudRisk <- as.factor(ccFraud$fraudRisk)
>str(ccFraud)
'data.frame': 10000000 obs. of 9 variables:
$ custID : int 1 2 3 4 5 6 7 8 9 10 ...
$ gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 2 1 1 2 1...
$ state : int 35 2 2 15 46 44 3 10 32 23...
$ cardholder : int 1 1 1 1 1 2 1 1 1 1 ...
$ balance : int 3000 0 0 0 0 5546 2000 60162428 0 ...
$ numTrans : int 4 9 27 12 11 21 41 20 4 18...
$ numIntlTrans: int 14 0 9 0 16 0 0 3 10 56 ...
$ creditLine : int 2 18 16 5 7 13 1 6 22 5 ...
$ fraudRisk : Factor w/ 2 levels"0","1": 1 1 1 1 1 1 1 1 1 1 ...
2、 数据探索:查看“fraudRisk”变量中0、1的频数及占比情况。
答:此题是送分题,通过table函数、prop.table函数即可实现。代码如下:
> table(ccFraud$fraudRisk)
0 1
9403986 596014
> prop.table(table(ccFraud$fraudRisk))
0 1
0.9403986 0.0596014
3、 数据分区:需要按照变量fraudRisk来进行等比例抽样,其中80%作为训练集train数据,20%作为测试集test数据。
答:由于要根据fraudRisk变量进行等比例抽样,我们用caret包的createDataPartition函数实现。代码如下:
> library(caret)
载入需要的程辑包:lattice
载入需要的程辑包:ggplot2
> idx <- createDataPartition(ccFraud$fraudRisk,p=0.8,list=F)
> train <- ccFraud[idx,]
> test <- ccFraud[-idx,]
> prop.table(table(train$fraudRisk))
0 1
0.94039851 0.05960149
> prop.table(table(test$fraudRisk))
0 1
0.94039897 0.05960103
4、 建立模型:利用至少三种常见的分类算法(如KNN近邻算法、决策树算法、随机森林等)对数据建立预测模型。
答:由于数据量较大,学员反馈运行慢,这边利用MicrosoftML包来跑模型。关于MRS的快速入门请查阅之前文章:https://ask.hellobi.com/blog/xiejiabiao/8559
> # 模型一:利用MicrosoftML包的rxFastTrees()函数构建快速决策树模型
> (a <- Sys.time()) #模型运行前时间
[1] "2017-09-03 23:32:04 CST"
> treeModel <- rxFastTrees(fraudRisk ~ gender + cardholder + balance + numTrans
+ + numIntlTrans + creditLine,data = train)
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Processed 8000001 instances
Binning and forming Feature objects
Reserved memory for tree learner: 79664 bytes
Starting to train ...
Not training a calibrator because it is not needed.
Elapsed time: 00:01:04.6222538
> (b <- Sys.time()) #模型运行后时间
[1] "2017-09-03 23:33:09 CST"
> b-a # 模型运行时长
Time difference of 1.086313 mins
> # 模型二:利用MicrosoftML包的rxFastForest()函数构建快速随机森林模型
> (a <- Sys.time()) #模型运行前时间
[1] "2017-09-03 23:33:31 CST"
> forestModel <- rxFastForest(fraudRisk ~ gender + cardholder + balance + numTrans
+ + numIntlTrans + creditLine,data = train)
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Processed 8000001 instances
Binning and forming Feature objects
Reserved memory for tree learner: 79664 bytes
Starting to train ...
Training calibrator.
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:01:25.4585776
> (b <- Sys.time()) #FastTrees模型运行后时间
[1] "2017-09-03 23:34:57 CST"
> b-a # 模型运行时长
Time difference of 1.433823 mins
> # 模型三:利用MicrosoftML包的rxLogisticRegression()函数构建快速逻辑回归模型
> (a <- Sys.time()) #模型运行前时间
[1] "2017-09-03 23:34:57 CST"
> logitModel <- rxLogisticRegression(fraudRisk ~ gender + cardholder + balance + numTrans
+ + numIntlTrans + creditLine,data = train)
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Beginning optimization
num vars: 8
improvement criterion: Mean Improvement
L1 regularization selected 8 of 8 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:19.5887244
Elapsed time: 00:00:00.0383181
> (b <- Sys.time()) #模型运行后时间
[1] "2017-09-03 23:35:17 CST"
> b-a # 模型运行时长
Time difference of 20.27396 secs
>
逻辑回归模型运行时间最短,消耗20.3秒,其次是决策树,消耗1.08分钟,时间最长的是随机森林,时长为1.4分钟。
5、 模型评估:利用上面建立的预测模型(至少第四点建立的三个模型),对训练集和测试集数据进行预测,评估模型效果,最终选择最优模型作为以后的业务预测模型。(提示:构建混淆矩阵)
答:针对上面的三种模型,我们分别对train、test数据集进行预测并评估。
# 利用决策树模型对数据进行预测,并计算误差率
> treePred_tr <- rxPredict(treeModel,data = train)
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:52.1015119
Finished writing 8000001 rows.
Writing completed.
> t <- table(train$fraudRisk,treePred_tr$PredictedLabel)
> t
0 1
0 7446742 76447
1 253008 223804
> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算决策树对train数据集的预测误差率
[1] "4.1%"
> treePred_te <- rxPredict(treeModel,data = test)
Beginning processing data.
Rows Read: 1999999, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:13.4980323
Finished writing 1999999 rows.
Writing completed.
> t1 <- table(test$fraudRisk,treePred_te$PredictedLabel)
> t1
0 1
0 1861406 19391
1 63176 56026
> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t1),3)*100,"%")) #计算决策树对test数据集的预测误差率
[1] "4.1%"
> # 利用随机森林模型对数据进行预测,并计算误差率
> forestPred_tr <- rxPredict(forestModel,data = train)
Beginning processing data.
Rows Read: 8000001, Read Time: 0.001, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:56.2862657
Finished writing 8000001 rows.
Writing completed.
> t <- table(train$fraudRisk,forestPred_tr$PredictedLabel)
> t
0 1
0 7508808 14381
1 373777 103035
> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算随机森林对train数据集的预测误差率
[1] "4.9%"
> forestPred_te <- rxPredict(forestModel,data = test)
Beginning processing data.
Rows Read: 1999999, Read Time: 0.001, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:14.0430130
Finished writing 1999999 rows.
Writing completed.
> t1 <- table(test$fraudRisk,forestPred_te$PredictedLabel)
> t1
0 1
0 1877117 3680
1 93419 25783
> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t),3)*100,"%")) #计算随机森林对test数据集的预测误差率
[1] "1.2%"
> # 利用逻辑回归模型对数据进行预测,并计算误差率
> logitPred_tr <- rxPredict(logitModel,data = train)
Beginning processing data.
Rows Read: 8000001, Read Time: 0.001, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:08.1674394
Finished writing 8000001 rows.
Writing completed.
> t <- table(train$fraudRisk,logitPred_tr$PredictedLabel)
> t
0 1
0 7444156 79033
1 250679 226133
> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算逻辑回归对train数据集的预测误差率
[1] "4.1%"
> logitPred_te <- rxPredict(logitModel,data = test)
Beginning processing data.
Rows Read: 1999999, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:02.0736547
Finished writing 1999999 rows.
Writing completed.
> t1 <- table(test$fraudRisk,logitPred_te$PredictedLabel)
> t1
0 1
0 1860885 19912
1 62428 56774
> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t),3)*100,"%")) #计算逻辑回归对test数据集的预测误差率
[1] "1%"
从训练集和测试集的预测误差率来看,对于此份数据,逻辑回归是最优的选择。
从以上的建模速度可知,MRS是个好东西,能帮助我们对大数据利用机器学习算法快速建模 及预测,大家可以尝试按照 快速入门进行下载安装。