ben老师第二次作业提交

发表: 2018-06-11 浏览: 1467

八大直播作业

Created on Sat Jul 7 16:12:18 2018

@author: 知行合一

"""

'''

研究二手房价的影响因素，建立房价预测模型，数据存放在附件的“sndHsPr.csv”中。

分析思路：

在对房价的影响因素进行模型研究之前，首先对各变量进行描述性分析，以初步判断房价的影响因素，

进而建立房价预测模型。变量说明如下：

dist-所在区　　roomnum-室的数量　　halls-厅的数量　　　　AREA-房屋面积

floor-楼层　　subway-是否临近地铁　school-是否学区房　　price-平米单价

步骤如下：

（一）因变量分析：单位面积房价分析

（二）自变量分析：

2.1 自变量自身分布分析 2.2 自变量对因变量影响分析（描述统计和统计检验）

（三）建立房价预测模型

3.1 线性回归模型 3.2 对因变量取对数的线性模型 3.3 考虑交互项的对数线性

（四）预测：假设有一家三口，父母为了能让孩子在东城区上学，想买一套邻近地铁的两居室，

面积是70平方米，中层楼层，那么房价大约是多少呢？

'''

import os

import pandas as pd

import seaborn as sns

from scipy import stats

import sys

import numpy as np

sys.path.append(r"C:\Users\知行合一\Documents\toolsUnitl")

#get_ipython().magic('matplotlib inline')

os.chdir(r'C:\Users\知行合一\Documents\ben\第四讲作业')

sndHsPr=pd.read_csv('sndHsPr.csv',encoding='gbk')

sndHsPr=sndHsPr.sample(5000)

# In[1]:

#因变量分析：单位面积房价分析

sndHsPr['area_Price']=sndHsPr.price*sndHsPr.AREA

sndHsPr.area_Price.plot(kind='bar')

# In[1]:

##判断均价是否满足正态分布

sns.distplot(sndHsPr.area_Price, kde=True, fit=stats.norm)

# In[1]:

import statsmodels.api as sm

from matplotlib import pyplot as plt

fig=sm.qqplot(sndHsPr.area_Price, fit=True, line='45')

fig.show()

# Histograph

# In[1]:

sndHsPr.area_Price.plot(kind='box')

# In[1]:

sndHsPr.area_Price.describe()

#计算均价的置信区间se 代表房价的标准误=s/len(sample)**0.5

se=sndHsPr.area_Price.std()/len(sndHsPr)**0.5

LB=sndHsPr.area_Price.mean()-1.98*se

UB=sndHsPr.area_Price.mean()+1.98*se

(LB,UB)

# In[2]:

# 如果要求任意置信度下的置信区间的话，可以自己编一个函数

def confint(x, alpha=0.05):

n = len(x)

xb = x.mean()

df = n-1

tmp = (x.std() / n ** 0.5) * stats.t.ppf(1-alpha/2, df)

return {'Mean': xb, 'Degree of Freedom':df, 'LB':xb-tmp, 'UB':xb+tmp}

confint(sndHsPr.price, 0.05)

#In[4]:

#（二）自变量分析：var_discrete 为离散变量、var_continue为连续变量

#目标变量总体房价为连续行变量，则和var_continue进行相关分析，

#和var_discrete进行方差分析或者两样本T检验

var_discrete=[]

var_continue=[]

for col in sndHsPr.columns:

if sndHsPr[col].dtype.name=='object':

var_discrete.append(col)

else:

var_continue.append(col)

##检查分类结果是否正确

sndHsPr.dtypes

# In[2]:

#连续变量和分类变量进行分析(dist为多分类变量，和areaPrice进行方差分析)

dist=sndHsPr.dist.unique().tolist()

dist_dict={'xicheng':'西城区','fengtai':'丰台区','haidian':'海淀区',

'dongcheng':'东城区','chaoyang':'朝阳区','shijingshan':'石景山区'}

sndHsPr['dist']=sndHsPr['dist'].map(dist_dict)

# In[2]:

from pylab import mpl

mpl.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体

mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题

sndHsPr['dist'].value_counts().plot(kind='bar')

##描述性统计分析，查看各区域房价的均价分布关系

sndHsPr.area_Price.groupby(sndHsPr.dist).mean().plot(kind='bar')

# In[34]:

##描述性统计分析，查看各区域房价的均价分布关系

sndHsPr.area_Price.groupby(sndHsPr.dist).mean().sort_values(ascending=True).plot(kind='barh')

# In[37]:

##描述性统计分析，分箱合须图（x为分类变量，y为连续变量）

sns.boxplot(x = 'dist', y = 'area_Price', data = sndHsPr)

# ### 1.3 汇总表

# In[38]:

sndHsPr.pivot_table(values='area_Price',index='dist',aggfunc=np.mean)

# In[39]:

sndHsPr.pivot_table(values='area_Price',index='dist',aggfunc=np.mean).plot(kind = 'bar')

# In[39]:

# 利用回归模型中的方差分析(P值小于0.05代表有相关)

import statsmodels.api as sm

from statsmodels.formula.api import ols

sm.stats.anova_lm(ols('area_Price ~ C(dist)',data=sndHsPr).fit()).iloc[0,-1]

# In[15]:

floor_dict={'middle':'中层','high':'高层','low':'低层'}

sndHsPr.floor=sndHsPr.floor.map(floor_dict)

# In[15]:

##floor类似同上

sndHsPr.floor.value_counts().plot(kind='bar')

# In[15]:

sndHsPr.area_Price.groupby(sndHsPr.floor).mean().plot(kind='bar')

# In[15]:

sndHsPr.area_Price.groupby(sndHsPr.floor).mean().sort_values(ascending=True).plot(kind='barh')

# In[15]:

sns.boxplot(x='floor',y='area_Price',data=sndHsPr)

# In[15]:

sndHsPr.pivot_table(index='floor',values=['area_Price'],aggfunc=np.mean)

# In[15]:

sndHsPr.pivot_table(index='floor',values=['area_Price'],aggfunc=np.mean).plot(kind='bar')

# In[15]:- 方差分析

sm.stats.anova_lm(ols('area_Price ~ C(floor)',data=sndHsPr).fit())

# In[16]:对比和区域的差异

sm.stats.anova_lm(ols('area_Price ~ C(dist)',data=sndHsPr).fit())

# In[16]:相关分析,通过count查看是否有缺失值

sndHsPr[var_continue].describe().T

# In[16]:

##roomnum与area_Price的相关性分析

##1、描述性统计，通过散点图查看

sndHsPr.plot(x='roomnum',y='area_Price',kind='scatter')

sndHsPr.roomnum.unique()

# In[16]:

#通过分析roomnum也转化为分类变量进行分析

sndHsPr.area_Price.groupby(sndHsPr['roomnum']).mean().plot(kind='bar')

# In[16]:

sndHsPr.area_Price.groupby(sndHsPr.roomnum).mean().sort_values(ascending=True).plot(kind='barh')

# In[16]:

sns.boxplot(x='roomnum',y='area_Price',data=sndHsPr)

sndHsPr.pivot_table(index='roomnum',values=['area_Price'],aggfunc=[np.mean])

# In[16]:

sndHsPr.pivot_table(index='roomnum',values=['area_Price'],aggfunc=[np.mean]).plot(kind='bar')

# In[15]:- 方差分析

sm.stats.anova_lm(ols('area_Price ~ roomnum',data=sndHsPr).fit())

# In[20]:- 相关分析描述性统计

var_continue.remove('halls')

var_discrete.append('halls')

var_continue.remove('school')

var_discrete.append('school')

var_continue.remove('subway')

var_discrete.append('subway')

#sndHsPr.plot(x='halls',y='area_Price',kind='scatter')

#sndHsPr.plot(x='subway',y='area_Price',kind='scatter')

#sndHsPr.plot(x='school',y='area_Price',kind='scatter')

# In[20]:

var_continue.remove('roomnum')

var_discrete.append('roomnum')

# In[20]:

#对连续变量AREA、price、进行相关分析

sndHsPr.plot(x='price',y='area_Price',kind='scatter')

# In[20]:

sndHsPr['price_ln']=np.log(sndHsPr['price'])

sndHsPr.plot(x='price_ln',y='area_Price',kind='scatter')

# In[20]:

sndHsPr.plot(x='AREA',y='area_Price',kind='scatter')

# In[21]:

sndHsPr[['AREA','area_Price']].corr(method='pearson')

# In[22]:

sndHsPr[['price_ln','area_Price']].corr(method='pearson')

# In[23]:

sndHsPr[['price','area_Price']].corr(method='pearson')

# In[23]:

dist_num={'西城区':0,'丰台区':1,'海淀区':2,

'东城区':3,'朝阳区':4,'石景山区':5}

floor_num={'中层':0,'高层':1,'low':2}

sndHsPr['dist_num']=sndHsPr['dist'].map(dist_num)

sndHsPr['floor_num']=sndHsPr['floor'].map(floor_num)

lineMode = sm.formula.ols('area_Price ~ AREA+price+C(dist_num)+C(floor_dict)+C(halls)+C(school)+C(subway)+C(roomnum)', sndHsPr).fit()

# In[24]:

#构造X变量，进行预测

#四预测：假设有一家三口，父母为了能让孩子在东城区上学，想买一套邻近地铁的两居室，

#面积是70平方米，中层楼层，那么房价大约是多少呢？

feature={'AREA':70,

'dist_num':3,

'price':sndHsPr[sndHsPr['dist_num']==3]['price'].mean(),

'floor_dict':0,

'halls':int(sndHsPr[sndHsPr['dist_num']==3]['halls'].mean()),

'school':1,

'subway':1,

'roomnum':2

}

predict_X=pd.DataFrame([feature])

# In[24]:

result=lineMode.predict(predict_X)[0]

0 个评论

要回复文章请先登录或注册