要求如下:
研究二手房价的影响因素,建立房价预测模型,数据存放在“sndHsPr.csv”中。
分析思路:
在对房价的影响因素进行模型研究之前,首先对各变量进行描述性分析,以初步判断房价的影响因素,进而建立房价预测模型
变量说明如下:
dist-所在区
roomnum-室的数量
halls-厅的数量
AREA-房屋面积
floor-楼层
subway-是否临近地铁
school-是否学区房
price-平米单价
步骤如下:
(一) 因变量分析:单位面积房价分析
(二) 自变量分析:
2.1 自变量自身分布分析
2.2 自变量对因变量影响分析
(三)建立房价预测模型
3.1 线性回归模型
3.2 对因变量取对数的线性模型
3.3 考虑交互项的对数线性
(四)预测: 假设有一家三口,父母为了能让孩子在东城区上学,想买一套邻近地铁的两居室,面积是70平方米,中层楼层,那么房价大约是多少呢?
# coding: utf-8
# In[73]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import scipy.stats as stats
from statsmodels.formula.api import ols
# In[2]:
data = pd.read_csv(r'E:\数据分析\天善-PYTHON\八场直播,八大案例(金融风控)\第4讲\提交-第四讲:统计建模与分析报告-二手房价格分析报告\作业\sndHsPr.csv',encoding = 'gbk')
data.head()
# In[4]:
#单位面积房价分析
data.price.describe()
# In[14]:
#自变量自身分布分析
data1 = ['dist','roomnum','halls','floor','subway','school']
data2 = ['AREA']
for i in data1:
print('name:',i)
print(data[i].value_counts())
data[i].value_counts().plot(kind = 'bar')
plt.show()
for i in data2:
print('name:',i)
print(data[i].describe())
sns.distplot(data[i],kde=True,fit=stats.norm)
fig = sm.qqplot(data[i],fit=True,line='45')
fig.show()
# In[65]:
#自变量自身分布分析
#看变量有哪些类型
#data.dtypes
for i in data.columns.values:
if data[i].dtypes == 'float64':
plt.boxplot(x = data[i])
plt.title(i)
plt.show()
else:
data[i].value_counts().plot(kind = 'bar')
plt.title(i)
plt.show()
# In[70]:
#自变量对因变量影响分析
#1.X、Y都是连续变量
var1 = ['roomnum','halls','AREA']
for i in var1:
print(data[[i,'price']].corr(method = 'pearson'))
#2.X是分类变量
var2 = ['subway','school']
for i in var2:
print(data[[i,'price']].corr(method = 'spearman'))
# In[75]:
#建立房价预测模型
lm_s = ols('price~roomnum+halls+AREA+subway+school',data = data).fit()
lm_s.summary()
# In[76]:
'''forward select'''
def forward_select(data, response):
remaining = set(data.columns)
remaining.remove(response)
selected = []
current_score, best_new_score = float('inf'), float('inf')
while remaining:
aic_with_candidates=[]
for candidate in remaining:
formula = "{} ~ {}".format(
response,' + '.join(selected + [candidate]))
aic = ols(formula=formula, data=data).fit().aic
aic_with_candidates.append((aic, candidate))
aic_with_candidates.sort(reverse=True)
best_new_score, best_candidate=aic_with_candidates.pop()
if current_score > best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
print ('aic is {},continuing!'.format(current_score))
else:
print ('forward selection over!')
break
formula = "{} ~ {} ".format(response,' + '.join(selected))
print('final formula is {}'.format(formula))
model = ols(formula=formula, data=data).fit()
return(model)
# In[81]:
data_for_select = data[['price','roomnum','halls','AREA','subway','school']]
lm_m = forward_select(data = data_for_select,response = 'price')
# In[86]:
#对因变量取对数的线性模型
data['ln_price'] =np.log(data['price'])
lm_s2 = ols('ln_price~roomnum+halls+AREA+subway+school',data = data).fit()
lm_s2.summary()
data_for_select2 = data[['ln_price','roomnum','halls','AREA','subway','school']]
lm_m2 = forward_select(data_for_select2,response = 'ln_price')
# In[92]:
#预测:假设有一家三口,父母为了能让孩子在东城区上学,想买一套邻近地铁的两居室,面积是70平方米,中层楼层,那么房价大约是多少呢?
predict_data=[{"dist":"东城区","roomnum":2,"halls":0,"floor":"middle",
"subway":1,"school":1,"AREA":70,"price":0,"ln_price":0}]
predict_data_final = pd.DataFrame(predict_data)
lm_s2.predict(predict_data_final)