python爬虫-url

发表: 2017-08-16 浏览: 1142

Python

特此声明：

以下内容来源于博主：http://blog.csdn.net/pleasecallmewhy

http://cuiqingcai.com/

根据需要整理到自己的笔记中，用于学习。

网页抓取：把URL地址中指定的网络资源从网络流中读取出来，保存到本地。

在python中，使用urllib2来抓取网页。以urlopen函数的形式提供了一个非常简单的接口

函数：urlopen(url, data, timeout)

url：网址 data：访问URL时要传送的数据 timeout：设置超时时间

 import urllib2

 response=urllib2.urlopen('http://www.hao123.com') #调用的是urllib2库里面的urlopen方法，传入一个URL

 html=response.read()   #response对象有一个read方法，可以返回获取到的网页内容

 print(html)

urllib2用一个Request对象来映射你提出的HTTP请求。

在它最简单的使用形式中你将用你要请求的地址创建一个Request对象，

通过调用urlopen并传入Request对象，将返回一个相关请求response对象，

这个应答对象如同一个文件对象，所以你可以在Response中调用.read()。

 import urllib2

 req=urllib2.Request('http://www.hao123.com')

 response=urllib2.urlopen(req)   #返回信息便保存在response对象里面

the_page=response.read()

 print(the_page)

传送数据（POST和GET两种方式）

GET方式：

POST方式：

POST方式
 import urllib2

 import urllib

 values={'username':'wujiadong','passward':'1234567'}#要传送的数据   

 url='http://www.baidu.com' #爬取的网页

data=urllib.urlencode(values)  #编码工作

 request=urllib2.Request(url,data)  #发送请求同时传送表单

response=urllib2.urlopen(request)   #接收反馈的信息

 html=response.read()             #读取反馈的内容

 print(html)

import urllib2

 import urllib

 values={}

 values['username']='wujiadong'

 values['password']='1234567'

data=urllib.urlencode(values)

print(data)

url='https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn'

 ful_url=url+'?'+data

 request=urllib2.Request(ful_url)

 response=urllib2.urlopen(request)

 html=response.read()

 print(html)

 print(ful_url)

设置Headers到http请求（通过header伪装成浏览器进行访问）

对有些 header 要特别留意，服务器会针对这些 header 做检查

import urllib2

 import urllib

url='http://www.baidu.com'

 values={}

 values['username']='wujiadong'

 values['password']='1234567'

 user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

 headers={'Uer-Agent':'user_agent'}

 data=urllib.urlencode(values)

 request=urllib2.Request(url,data,headers)

 response=urllib2.urlopen(request)

 page=response.read()

 print(page)

0 个评论

要回复文章请先登录或注册