数据科学探索（python 与 R 的比较）

发表: 2017-05-23 浏览: 2266

数据分析机器学习 R语言 Python

这一期主要应用python和R 这2种工具对某真实信贷数据进行分析，通过数据的读取、清洗、探索、模型构建等，比较2种方法在机器学习数据科学上的实现。python实现部分借鉴了KUNAL JAIN 在Analytics Vidhya中的一篇文章《A Complete Tutorial to Learn Data Science with Python from Scratch》（下面有原文链接）。

1 数据科学探索（python 与 R 的比较）

1 数据科学探索（python 与 R 的比较）

首次尝试使用jupyter notebook来实现python 和 R，感觉在R的部分兼容性方面还是与rmarkdown有些差距。也许是还不熟悉的原因吧。如何在jupyter中跑R 可以查看[Jupyter and conda for R](http://www.tuicool.com/articles/nuaiEnF) 这篇文章.

我们分别从数据科学的主要流程来讨论Python与R 的数据实现。

1.1 数据读取

1.1.1 Python

在python中运用jupyter notebook绘制图形，需要“%matplotlib inline”在代码开头说明。载入相应的库，主要包括numpy、pandas、matplotlib等。python与R一般的操作比较，直接比较代码即可。

%matplotlib inline

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

df = pd.read_csv("C:/Users/HP/Desktop/train.csv")

df.head()

df.describe()

1.1.2 R

在jupyter中跑R，除安装好R的驱动kernel外，下载并加载包，需要先指定mirror（本人版本如此，求高人指点）。运用psych包中的describe函数可获得类似Python效果。

#设置mirror
local({r <- getOption("repos")r["CRAN"] <- "http://mirrors.xmu.edu.cn/CRAN/"options(repos=r)})

library(ggplot2)
library(dplyr)
library(psych)
library(Hmisc)

train <- read.csv("C:/Users/HP/Desktop/train.csv")

head(train)

data.frame(psych::describe(train))

1.2数据描述

主要包括对计量资料和分类变量的分布进行绘图描述。python主要运用到matplotlib库，R主要运用到ggplot2包。从数据描述及绘图效果来看，python绘制一般图形相对简单，但R运用ggplot图层语法绘图感觉更易理解和操作。

1.2.1 python

#查看申请人地域分布
df['Property_Area'].value_counts()

Semiurban    233

Urban        202

Rural        179

Name: Property_Area, dtype: int64

#查看申请人收入分布情况
df['ApplicantIncome'].hist(bins=50)

#按教育情况分类，查看收入情况
df.boxplot(column='ApplicantIncome', by = 'Education')

#查看申请人信用历史分布并绘图
temp1 = df['Credit_History'].value_counts(ascending=True)
temp2 = df.pivot_table(values='Loan_Status', index=['Credit_History'],aggfunc=lambda x: x.map({'Y':1,"N":0}).mean())
print 'Frequency Table for Credit History:'print temp1



print '\nProbility of getting loan for each Credit History class:'print temp2

Frequency Table for Credit History:

0.0     89

1.0    475

Name: Credit_History, dtype: int64



Probility of getting loan for each Credit History class:

Credit_History

0.0    0.078652

1.0    0.795789

Name: Loan_Status, dtype: float64

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)ax1.set_xlabel('Credit_History')
ax1.set_ylabel('Count of Applicants')
ax1.set_title('Applicants by Credit_History')
temp1.plot(kind= 'bar')

ax2 = fig.add_subplot(122)
temp2.plot(kind = 'bar')
ax2.set_xlabel('credit_history')
ax2.set_ylabel('probability of getting loan')
ax2.set_title('probability of getting loan by credit history')

pd.crosstab(df['Credit_History'], df['Loan_Status'])

temp3.plot(kind='bar', stacked=True, color=['red','blue'],grid=False)

1.2.2 R

#申请人地域分布
table(train$Property_Area)

Rural Semiurban     Urban 

  179       233       202

#申请人收入分布情况
ggplot(data=train, aes(x=ApplicantIncome)) + geom_histogram(bins = 50, fill = "blue")

ggplot(data=train, aes(x=Education, y=ApplicantIncome)) + geom_boxplot()

1.3 数据处理

数据处理主要包括以下方面：

缺失值处理
极值处理

对数据的处理，这里缺失值填补只运用了最简单的平均数或众数替换，数据变换也只进行相对简单的log变换等。感觉R的dplyr包管道操作在数据处理方面比python更胜一筹。

1.3.1 python

# 检测缺失值
df.apply(lambda x: sum(x.isnull()), axis = 0)

Loan_ID               0

Gender               13

Married               3

Dependents           15

Education             0

Self_Employed        32

ApplicantIncome       0

CoapplicantIncome     0

LoanAmount           22

Loan_Amount_Term     14

Credit_History       50

Property_Area         0

Loan_Status           0

dtype: int64

#计量资料用均值、计数资料用众数填补
df['LoanAmount'].fillna(df['LoanAmount'].mean(),inplace=True)

#查看分布
df['Self_Employed'].value_counts()
df['Credit_History'].value_counts()
df['Gender'].value_counts() 
df['Loan_Amount_Term'].value_counts()
df['Dependents'].value_counts()
df['Married'].value_counts()

Yes    398

No     213

Name: Married, dtype: int64

df['Loan_Amount_Term'].fillna(360.0,inplace=True)

df['Gender'].fillna(2,inplace=True)

df['Credit_History'].fillna(1.0,inplace=True)df['Dependents'].fillna(0,inplace=True)

table = df.pivot_table(values='LoanAmount',index='Self_Employed',columns='Education',aggfunc=np.median)print(table)

Education      Graduate  Not Graduate

Self_Employed                        

NO                126.5         123.0

No                132.0         115.0

Yes               152.0         130.0

df['Self_Employed'].fillna('NO',inplace=True)df['Married'].fillna('YES',inplace=True)

#最后查看填补结果
df.apply(lambda x: sum(x.isnull()), axis = 0)

Loan_ID              0

Gender               0

Married              0

Dependents           0

Education            0

Self_Employed        0

ApplicantIncome      0

CoapplicantIncome    0

LoanAmount           0

Loan_Amount_Term     0

Credit_History       0

Property_Area        0

Loan_Status          0

dtype: int64

#数据变换#转换为收入log值
df['LoanAmount_log'] = np.log(df['LoanAmount'])df['LoanAmount_log'].hist(bins=20)

df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df['TotalIncome_log'] = np.log(df['TotalIncome'])df['LoanAmount_log'].hist(bins=20)

1.3.2 R

#查看缺失情况
pMiss <- function(x) {

    sum(is.na(x))/length(x)*100 }
apply(train, 2, pMiss)

      Loan_ID            Gender           Married        Dependents         Education 

     0.000000          0.000000          0.000000          0.000000          0.000000 

Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount  Loan_Amount_Term 

     0.000000          0.000000          0.000000          3.583062          2.280130

library(Hmisc)
train$LoanAmount <- Hmisc::impute(train$LoanAmount, mean)

train$Loan_Amount_Term <- impute(train$Loan_Amount_Term, 360.0)
train$Credit_History <- impute(train$Credit_History, 1.0)
train$Self_Employed <- impute(train$Self_Employed, "NO")
train$Married <- impute(train$Married, "YES")
train$Gender <- impute(train$Gender, 2)

#数据变换#转换为收入log值
library(dplyr)
train = mutate(train, LoanAmount_log = log(LoanAmount), TotalIncome = ApplicantIncome + CoapplicantIncome,TotalIncome_log =log(TotalIncome))
apply(train, 2, pMiss)

      Loan_ID            Gender           Married        Dependents         Education 

            0                 0                 0                 0                 0 

Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount  Loan_Amount_Term 

            0                 0                 0                 0                 0

1.4 建立预测模型

在建立预测模型方面，分布建立logistic回归、决策树、随机森林并加以比较和改善。从代码编辑上看，python通过sklearn库构建预测模型函数，实现建立模型的打包处理，比R 分别调用不同的包进行模型模拟有更大的优势，值得学习和借鉴。

1.4.1 python

#转变所有的分类变量为数值变量
from sklearn.preprocessing import LabelEncoder

var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']le = LabelEncoder()for i in var_mod:

    df[i] = le.fit_transform(df[i])df.dtypes

Loan_ID               object

Gender                 int64

Married                int64

Dependents             int64

Education              int64

Self_Employed          int64

ApplicantIncome        int64

CoapplicantIncome    float64

LoanAmount           float64

Loan_Amount_Term     float64

Credit_History       float64

Property_Area          int64

Loan_Status            int64

LoanAmount_log       float64

TotalIncome          float64

TotalIncome_log      float64

dtype: object

#载入sklearn库，选择相应的操作。

from sklearn.cross_validation import KFold   #For K-fold cross validationfrom sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeClassifier, export_graphviz

from sklearn import metrics#Generic function for making a classification model and accessing performance:def classification_model(model, data, predictors, outcome):

  #拟合模型

  model.fit(data[predictors],data[outcome])

  

  #在训练集上预测模型

  predictions = model.predict(data[predictors])

  

  #打印模型准确性

  accuracy = metrics.accuracy_score(predictions,data[outcome])

  print "Accuracy : %s" % "{0:.3%}".format(accuracy) 
#完成5折交叉验证
kf = KFold(data.shape[0], n_folds=5)  
error = []  
for train, test in kf:      
  train_predictors = (data[predictors].iloc[train,:])        
  train_target = data[outcome].iloc[train]        
  model.fit(train_predictors, train_target)   #拟合模型       
  error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))   
print "Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error))



#模型最终拟合

model.fit(data[predictors],data[outcome])

1.4.2 R

library(rpart)
library(e1071)
library(rpart.plot)
library(caret)
library(Metrics)

test <- read.csv("C:/Users/HP/Desktop/test.csv")

1.5 Logistic

1.5.1 python

outcome_var = 'Loan_Status'model = LogisticRegression()predictor_var = ['Credit_History']classification_model(model, df,predictor_var,outcome_var)

Accuracy : 80.945%

Cross-Validation Score : 80.946%

1.5.2 R

#设置控制参数
fitControl <- trainControl(method = "cv", number = 5)
#拟合模型
logistic_model <- train(Loan_Status ~., data= train, method = "glmnet", trControl = fitControl, family="binormal")
print(logistic_model)

#选择最优模型logistic_model <- train(Loan_Status ~., data= train, method = "glmnet", trControl = fitControl, family="binormal")
#模型预测logistic_predict <- predict(logistic_model, newdata= test,type = "vector")

#模型评价
confusionMatrix(train$Loan_Status, logistic_model)

1.6 决策树

1.6.1 python

model = DecisionTreeClassifier()predictor_var = ['Credit_History','Gender','Married','Education']classification_model(model, df,predictor_var,outcome_var)

Accuracy : 80.945%

Cross-Validation Score : 80.946%

1.6.2 R

#设置控制参数
fitControl <- trainControl(method = "cv", number = 5)cartGrid <- expand.grid(.cp=(1:50)*0.01)
#拟合模型
tree_model <- train(Loan_Status ~Credit_History+Gender+Married+Education, data= train, method = "rpart", trControl = fitControl, tuneGrid = cartGrid)
print(tree_model)

#选择最优决策树模型main_tree <- rpart(Loan_Status ~., data= train, control = rpart.control(cp=0.36))

#绘制决策树prp(main_tree)

#模型预测pre_score <- predict(main_tree,type = "vector")#模型评价xtabs(~ pre_score + train$Loan_Status)

1.7 随机森林

1.7.1 python

#随机森林model = RandomForestClassifier(n_estimators=100)predictor_var = ['Gender', 'Married', 'Dependents', 'Education',

       'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area']classification_model(model, df,predictor_var,outcome_var)

Accuracy : 87.296%

Cross-Validation Score : 76.549%

#变量重要性得分
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print featimp

Credit_History      0.427036

Dependents          0.142901

Property_Area       0.102788

Loan_Amount_Term    0.100776

Self_Employed       0.077128

Gender              0.055985

Education           0.048056

Married             0.045331

dtype: float64

model = RandomForestClassifier(n_estimators=25, min_samples_split=25, max_depth=7, max_features=1)
predictor_var = ['Credit_History','Dependents','Property_Area']classification_model(model, df,predictor_var,outcome_var)

Accuracy : 80.945%

Cross-Validation Score : 80.458%

1.7.2 R

library(randomForest)

#设置参数
control <- trainControl(method = "cv", number = 5)
#随机森林
rf_model <- train(Loan_Status ~Credit_History+Dependents+Property_Area, data= train, method = "parRF",

                 trControl= control, prox = TRUE, allowParallel= TRUE)print(rf_model)

#选择最优决策树
forest_model <- randomForest(Loan_Status ~Credit_History+Dependents+Property_Area, data= train,

                            mtry = 15, ntree = 1000)print(forest_model)

#查看变量重要性排序
varImplot(forest_model)

#模型预测
main_predict <- predict(forest_model, newdata= test,type = "vector")

#模型评价
confusionMatrix(train$Loan_Status, main_predict)

写在文后（此处有雷）：

由于在jupyter中跑R，有些模型会出现一些莫名的问题和错误，因此R有些代码在jupyter中并不能跑成功。本人也在抓紧找出原因并解决，如出现错误，请各位大侠指正。感激不尽！

”乐享数据“个人公众号，不代表任何团体利益，亦无任何商业目的。任何形式的转载、演绎必须经过公众号联系原作者获得授权，保留一切权力。欢迎关注“乐享数据”。

0 个评论

要回复文章请先登录或注册