大作业一答案(利用算法识别风险欺诈用户)

浏览: 2771

作业一的数据量稍微有点大,目的就是为了考察学员对大数据的处理和建模能力。多位学员在做题过程中提到数据量大模型跑起来慢的问题,此时有学员提出利用并行方式跑模型的方式来解决大数据运行问题是一个非常好的思路。关于作业一的建模这边使用Microsoft R Service中的基于企业级大数据机器学习平台的MicrosoftML包中的快速机器学习算法来跑模型,期望能给学员遇到大数据情况下能运用R语言进行建模工作。

1、 数据类型转换:因”gender”实际代表男性、女性用户,故需要将“gender”变量转换成因子变量,且因子水平用“F”替换1,“M”替换2;将“fraudRisk”变量也转换成因子变量。

答:首先将数据导入到R中,并查看数据维度:

> # 导入数据
> ccFraud <- read.csv("ccFraud.csv")
> # 查看数据维度
> str(ccFraud)
'data.frame': 10000000 obs. of 9 variables:
$ custID : int 1 2 3 4 5 6 7 8 9 10 ...
$ gender : int 1 2 2 1 1 2 1 1 2 1 ...
$ state : int 35 2 2 15 46 44 3 10 32 23 ...
$ cardholder : int 1 1 1 1 1 2 1 1 1 1 ...
$ balance : int 3000 0 0 0 0 5546 2000 6016 2428 0 ...
$ numTrans : int 4 9 27 12 11 21 41 20 4 18 ...
$ numIntlTrans: int 14 0 9 0 16 0 0 3 10 56 ...
$ creditLine : int 2 18 16 5 7 13 1 6 22 5 ...
$ fraudRisk : int 0 0 0 0 0 0 0 0 0 0 ...

ccFraud数据集一共有一千万行9列,各列均为整型变量。按照题目要求先将“gender”变量转变为因子型,且因子水平用“F”替换1,“M”替换2。代码如下:

> ccFraud$gender <- factor(ifelse(ccFraud$gender==1,'F','M'))
> str(ccFraud)
'data.frame': 10000000 obs. of 9 variables:
$ custID : int 1 2 3 4 5 6 7 8 9 10 ...
$ gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 2 1 1 2 1 ...
$ state : int 35 2 2 15 46 44 3 10 32 23 ...
$ cardholder : int 1 1 1 1 1 2 1 1 1 1 ...
$ balance : int 3000 0 0 0 0 5546 2000 6016 2428 0 ...
$ numTrans : int 4 9 27 12 11 21 41 20 4 18 ...
$ numIntlTrans: int 14 0 9 0 16 0 0 3 10 56 ...
$ creditLine : int 2 18 16 5 7 13 1 6 22 5 ...
$ fraudRisk : int 0 0 0 0 0 0 0 0 0 0 ...

将“fraudRisk”变量也转换成因子变量,代码如下:

>ccFraud$fraudRisk <- as.factor(ccFraud$fraudRisk)
>str(ccFraud)
'data.frame':    10000000 obs. of  9 variables:
 $ custID     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ gender     : Factor w/ 2 levels "F","M": 1 2 2 1 1 2 1 1 2 1...
 $ state      : int  35 2 2 15 46 44 3 10 32 23...
 $ cardholder : int  1 1 1 1 1 2 1 1 1 1 ...
 $ balance    : int  3000 0 0 0 0 5546 2000 60162428 0 ...
 $ numTrans   : int  4 9 27 12 11 21 41 20 4 18...
 $ numIntlTrans: int  14 0 9 0 16 0 0 3 10 56 ...
 $ creditLine : int  2 18 16 5 7 13 1 6 22 5 ...
 $ fraudRisk   : Factor w/ 2 levels"0","1": 1 1 1 1 1 1 1 1 1 1 ...

2、 数据探索:查看“fraudRisk”变量中0、1的频数及占比情况。

答:此题是送分题,通过table函数、prop.table函数即可实现。代码如下:

> table(ccFraud$fraudRisk)

0 1
9403986 596014
> prop.table(table(ccFraud$fraudRisk))

0 1
0.9403986 0.0596014

3、 数据分区:需要按照变量fraudRisk来进行等比例抽样,其中80%作为训练集train数据,20%作为测试集test数据。

答:由于要根据fraudRisk变量进行等比例抽样,我们用caret包的createDataPartition函数实现。代码如下:

> library(caret)
载入需要的程辑包:lattice
载入需要的程辑包:ggplot2
> idx <- createDataPartition(ccFraud$fraudRisk,p=0.8,list=F)
> train <- ccFraud[idx,]
> test <- ccFraud[-idx,]
> prop.table(table(train$fraudRisk))
0 1
0.94039851 0.05960149
> prop.table(table(test$fraudRisk))
0 1
0.94039897 0.05960103

4、 建立模型:利用至少三种常见的分类算法(如KNN近邻算法、决策树算法、随机森林等)对数据建立预测模型。

答:由于数据量较大,学员反馈运行慢,这边利用MicrosoftML包来跑模型。关于MRS的快速入门请查阅之前文章:https://ask.hellobi.com/blog/xiejiabiao/8559

> # 模型一:利用MicrosoftML包的rxFastTrees()函数构建快速决策树模型
> (a <- Sys.time()) #模型运行前时间
[1] "2017-09-03 23:32:04 CST"
> treeModel <- rxFastTrees(fraudRisk ~ gender + cardholder + balance + numTrans
+ + numIntlTrans + creditLine,data = train)
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Processed 8000001 instances
Binning and forming Feature objects
Reserved memory for tree learner: 79664 bytes
Starting to train ...
Not training a calibrator because it is not needed.
Elapsed time: 00:01:04.6222538
> (b <- Sys.time()) #模型运行后时间
[1] "2017-09-03 23:33:09 CST"
> b-a # 模型运行时长
Time difference of 1.086313 mins
> # 模型二:利用MicrosoftML包的rxFastForest()函数构建快速随机森林模型
> (a <- Sys.time()) #模型运行前时间
[1] "2017-09-03 23:33:31 CST"
> forestModel <- rxFastForest(fraudRisk ~ gender + cardholder + balance + numTrans
+ + numIntlTrans + creditLine,data = train)
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Processed 8000001 instances
Binning and forming Feature objects
Reserved memory for tree learner: 79664 bytes
Starting to train ...
Training calibrator.
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:01:25.4585776
> (b <- Sys.time()) #FastTrees模型运行后时间
[1] "2017-09-03 23:34:57 CST"
> b-a # 模型运行时长
Time difference of 1.433823 mins
> # 模型三:利用MicrosoftML包的rxLogisticRegression()函数构建快速逻辑回归模型
> (a <- Sys.time()) #模型运行前时间
[1] "2017-09-03 23:34:57 CST"
> logitModel <- rxLogisticRegression(fraudRisk ~ gender + cardholder + balance + numTrans
+ + numIntlTrans + creditLine,data = train)
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Beginning optimization
num vars: 8
improvement criterion: Mean Improvement
L1 regularization selected 8 of 8 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:19.5887244
Elapsed time: 00:00:00.0383181
> (b <- Sys.time()) #模型运行后时间
[1] "2017-09-03 23:35:17 CST"
> b-a # 模型运行时长
Time difference of 20.27396 secs
>

逻辑回归模型运行时间最短,消耗20.3秒,其次是决策树,消耗1.08分钟,时间最长的是随机森林,时长为1.4分钟。

5、 模型评估:利用上面建立的预测模型(至少第四点建立的三个模型),对训练集和测试集数据进行预测,评估模型效果,最终选择最优模型作为以后的业务预测模型。(提示:构建混淆矩阵

答:针对上面的三种模型,我们分别对train、test数据集进行预测并评估。

# 利用决策树模型对数据进行预测,并计算误差率
> treePred_tr <- rxPredict(treeModel,data = train)
Beginning processing data.
Rows Read: 8000001, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:52.1015119
Finished writing 8000001 rows.
Writing completed.
> t <- table(train$fraudRisk,treePred_tr$PredictedLabel)
> t

0 1
0 7446742 76447
1 253008 223804
> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算决策树对train数据集的预测误差率
[1] "4.1%"
> treePred_te <- rxPredict(treeModel,data = test)
Beginning processing data.
Rows Read: 1999999, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:13.4980323
Finished writing 1999999 rows.
Writing completed.
> t1 <- table(test$fraudRisk,treePred_te$PredictedLabel)
> t1

0 1
0 1861406 19391
1 63176 56026
> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t1),3)*100,"%")) #计算决策树对test数据集的预测误差率
[1] "4.1%"
> # 利用随机森林模型对数据进行预测,并计算误差率
> forestPred_tr <- rxPredict(forestModel,data = train)
Beginning processing data.
Rows Read: 8000001, Read Time: 0.001, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:56.2862657
Finished writing 8000001 rows.
Writing completed.
> t <- table(train$fraudRisk,forestPred_tr$PredictedLabel)
> t

0 1
0 7508808 14381
1 373777 103035
> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算随机森林对train数据集的预测误差率
[1] "4.9%"
> forestPred_te <- rxPredict(forestModel,data = test)
Beginning processing data.
Rows Read: 1999999, Read Time: 0.001, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:14.0430130
Finished writing 1999999 rows.
Writing completed.
> t1 <- table(test$fraudRisk,forestPred_te$PredictedLabel)
> t1

0 1
0 1877117 3680
1 93419 25783
> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t),3)*100,"%")) #计算随机森林对test数据集的预测误差率
[1] "1.2%"
> # 利用逻辑回归模型对数据进行预测,并计算误差率
> logitPred_tr <- rxPredict(logitModel,data = train)
Beginning processing data.
Rows Read: 8000001, Read Time: 0.001, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:08.1674394
Finished writing 8000001 rows.
Writing completed.
> t <- table(train$fraudRisk,logitPred_tr$PredictedLabel)
> t

0 1
0 7444156 79033
1 250679 226133
> (paste0(round((sum(t)-sum(diag(t)))/sum(t),3)*100,"%")) #计算逻辑回归对train数据集的预测误差率
[1] "4.1%"
> logitPred_te <- rxPredict(logitModel,data = test)
Beginning processing data.
Rows Read: 1999999, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:02.0736547
Finished writing 1999999 rows.
Writing completed.
> t1 <- table(test$fraudRisk,logitPred_te$PredictedLabel)
> t1

0 1
0 1860885 19912
1 62428 56774
> (paste0(round((sum(t1)-sum(diag(t1)))/sum(t),3)*100,"%")) #计算逻辑回归对test数据集的预测误差率
[1] "1%"

从训练集和测试集的预测误差率来看,对于此份数据,逻辑回归是最优的选择。

从以上的建模速度可知,MRS是个好东西,能帮助我们对大数据利用机器学习算法快速建模 及预测,大家可以尝试按照 快速入门进行下载安装。

推荐 9
本文由 谢佳标 创作,采用 知识共享署名-相同方式共享 3.0 中国大陆许可协议 进行许可。
转载、引用前需联系作者,并署名作者且注明文章出处。
本站文章版权归原作者及原出处所有 。内容为作者个人观点, 并不代表本站赞同其观点和对其真实性负责。本站是一个个人学习交流的平台,并不用于任何商业目的,如果有任何问题,请及时联系我们,我们将根据著作权人的要求,立即更正或者删除有关内容。本站拥有对此声明的最终解释权。

4 个评论

谢老师,有数据吗?
谢佳标

谢佳标 回复 hmx

数据就是大作业一的数据,十五大案例学员的作业题来的
老师有什么调参的方法吗,有类似十折交叉的方法
还没好好研究这个包~

要回复文章请先登录注册