八大直播,八大案例_第五讲

浏览: 1043

要求:

电信公司希望针对客户的信息预测其流失可能性,数据存放在“telecom_churn.csv”中。

分析思路:

在对客户流失与否的影响因素进行模型研究之前,首先对各解释变量与被解释变量进行两变量独立性分析,以初步判断影响流失的因素,进而建立客户流失预测模型

主要变量说明如下:

#subscriberID="个人客户的ID"

#churn="是否流失:1=流失";

#Age="年龄"

#incomeCode="用户居住区域平均收入的代码"

#duration="在网时长"

#peakMinAv="统计期间内最高单月通话时长"

#peakMinDiff="统计期间结束月份与开始月份相比通话时长增加数量"

#posTrend="该用户通话时长是否呈现出上升态势:是=1"

#negTrend="该用户通话时长是否呈现出下降态势:是=1"

#nrProm="电话公司营销的数量"

#prom="最近一个月是否被营销过:是=1"

#curPlan="统计时间开始时套餐类型:1=最高通过200分钟;2=300分钟;3=350分钟;4=500分钟"

#avPlan="统计期间内平均套餐类型"

#planChange="统计期间是否更换过套餐:1=是"

#posPlanChange="统计期间是否提高套餐:1=是"

#negPlanChange="统计期间是否降低套餐:1=是"

#call_10086="拨打10086的次数"

步骤如下:

(一) 两变量分析:检验该用户通话时长是否呈现出上升态势(posTrend)对流失(churn) 是否有预测价值

(二) 首先将原始数据拆分为训练和测试数据集,使用训练数据集建立在网时长对流失的逻辑回归,使用测试数据集制作混淆矩阵(阈值为0.5),提供准确性、召回率指标,提供ROC曲线和AUC。

(三)使用向前逐步法从其它备选变量中选择变量,构建基于AIC的最优模型,绘制ROC曲线,同时检验模型的膨胀系数。


作业内容:


# coding: utf-8

# In[59]:

import pandas as pd
import os
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from statsmodels.formula.api import ols


# In[5]:

os.chdir(r'E:\数据分析\天善-PYTHON\八场直播,八大案例(金融风控)\第5讲\提交-第五讲:Logistic回归构建初始信用评级和分类模型检验\作业')
telecorm = pd.read_csv(r'telecom_churn.csv',encoding = 'gbk')
telecorm.head()


# In[6]:

#清洗数据
telecorm = telecorm.dropna()


# In[13]:

#两变量分析:检验该用户通话时长是否呈现出上升态势(posTrend)对流失(churn) 是否有预测价值
cross_table = pd.crosstab(telecorm.posTrend,telecorm.churn, margins=True)
def percConvert(ser):
return ser/float(ser[-1])
cross_table.apply(percConvert, axis=1)
print('''chisq = %6.4f
p-value = %6.4f
dof = %i
expected_freq = %s''' %stats.chi2_contingency(cross_table.iloc[:2, :2]))


# In[33]:

#将原始数据拆分为训练和测试数据集
#train = accepts.sample(frac=0.7, random_state=1234).copy()
#test = accepts[~ accepts.index.isin(train.index)].copy()
#print(' 训练集样本量: %i \n 测试集样本量: %i' %(len(train), len(test)))

train = telecorm.sample(frac =0.7,random_state = 1234).copy()
test = telecorm[~telecorm.index.isin(train.index)].copy()
print('训练集样本量: %i \n 训练集样本量:%i'%(len(train),len(test)))


# In[34]:

#使用训练数据集建立在网时长对流失的逻辑回归
lg = smf.glm('churn~duration',data = train,family =sm.families.Binomial(sm.families.links.logit)).fit()
lg.summary()


# In[35]:

train['proba'] = lg.predict(train)
test['proba'] = lg.predict(test)
test['proba'].head(10)


# In[49]:

#使用测试数据集制作混淆矩阵(阈值为0.5),提供准确性、召回率指标,提供ROC曲线和AUC
test['prediction'] = (test['proba'] > 0.5).astype('int')
pd.crosstab(test.churn, test.prediction, margins=True)
# 计算准确率、召回率
acc = sum(test['prediction'] == test['churn']) /np.float(len(test))
print('The accurancy is %.2f' %acc)
recall =sum((test['prediction'] == 1) & (test['churn'] == 1))/sum(test['churn'] == 1)
print('the recall is %.2f' %recall)


# In[53]:

fpr_test, tpr_test, th_test = metrics.roc_curve(test.churn, test.proba)
fpr_train, tpr_train, th_train = metrics.roc_curve(train.churn, train.proba)

plt.figure(figsize=[3, 3])
plt.plot(fpr_test, tpr_test, 'b--')
plt.plot(fpr_train, tpr_train, '')
plt.title('ROC curve')
plt.show()
print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))


# In[54]:

'''forward select'''
def forward_select(data, response):
remaining = set(data.columns)
remaining.remove(response)
selected = []
current_score, best_new_score = float('inf'), float('inf')
while remaining:
aic_with_candidates=[]
for candidate in remaining:
formula = "{} ~ {}".format(
response,' + '.join(selected + [candidate]))
aic = ols(formula=formula, data=data).fit().aic
aic_with_candidates.append((aic, candidate))
aic_with_candidates.sort(reverse=True)
best_new_score, best_candidate=aic_with_candidates.pop()
if current_score > best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
print ('aic is {},continuing!'.format(current_score))
else:
print ('forward selection over!')
break

formula = "{} ~ {} ".format(response,' + '.join(selected))
print('final formula is {}'.format(formula))
model = ols(formula=formula, data=data).fit()
return(model)


# In[60]:

data_for_select = telecorm[['AGE','churn','incomeCode','duration','peakMinAv','peakMinDiff','posTrend','negTrend','nrProm','prom','curPlan','avgplan', 'planChange','posPlanChange','negPlanChange','call_10086']]
lm_m = forward_select(data = telecorm,response = 'churn')


# In[62]:

train['proba'] = lm_m.predict(train)
test['proba'] = lm_m.predict(test)
fpr_test, tpr_test, th_test = metrics.roc_curve(test.churn, test.proba)
fpr_train, tpr_train, th_train = metrics.roc_curve(train.churn, train.proba)

plt.figure(figsize=[3, 3])
plt.plot(fpr_test, tpr_test, 'b--')
plt.plot(fpr_train, tpr_train, '')
plt.title('ROC curve')
plt.show()
print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))


# In[63]:

#膨胀系数
def vif(df, col_i):
from statsmodels.formula.api import ols

cols = list(df.columns)
cols.remove(col_i)
cols_noti = cols
formula = col_i + '~' + '+'.join(cols_noti)
r2 = ols(formula, df).fit().rsquared
return 1. / (1. - r2)
var = ['AGE','churn','incomeCode','duration','peakMinAv','peakMinDiff','posTrend','negTrend','nrProm','prom','curPlan','avgplan', 'planChange','posPlanChange','negPlanChange','call_10086']
data = train[var].drop(['churn'], axis=1)

for i in data.columns:
print(i, '\t', vif(df=data, col_i=i))


# In[ ]:




推荐 0
本文由 em16 创作,采用 知识共享署名-相同方式共享 3.0 中国大陆许可协议 进行许可。
转载、引用前需联系作者,并署名作者且注明文章出处。
本站文章版权归原作者及原出处所有 。内容为作者个人观点, 并不代表本站赞同其观点和对其真实性负责。本站是一个个人学习交流的平台,并不用于任何商业目的,如果有任何问题,请及时联系我们,我们将根据著作权人的要求,立即更正或者删除有关内容。本站拥有对此声明的最终解释权。

0 个评论

要回复文章请先登录注册