python爬虫-韩寒新浪博客博文

发表: 2017-08-16 浏览: 3397

Python

博客地址：http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html

爬第一页博文

#-*-coding:utf-8-*-

 import re
   #导入正则表达式模块

 import urllib
   #导入urllib库



 url='http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html'
   #第一页博文地址
 response = urllib.urlopen(url)
   #通过urllib库中的urlopen()函数来访问这个url
   #这里省略了构建request请求这一步
 html = response.read()
   #读取出来存在html这个变量当中，到这里也就完成了html的爬取
 #print(html)

 #这里可以将爬取到的html输出到终端

 pattern = re.compile('(.*?)',re.S)
   #通过正则表达式来匹配

 blog_address = re.findall(pattern,html)
   #通过findall函数从爬取到的html中找出所要的内容

 for i in blog_address:

    print(i[0])
       #输出第一个分组的内容即博客博文地址

    print(i[1])
      #输出第二个分组的内容即博文标题

部分结果如下：

所遇到的问题：1爬取的结果多了两个，第一个和最后一个不是所要的内容？

2 输出结果的时候用print(i[0],i[1])出现乱码，这是为什么？

通过while循环来解决多页的问题

 #-*-coding:utf-8-*-

import re

 import urllib

 page=1

 while page<=7:

    url='http://blog.sina.com.cn/s/articlelist_1191258123_0_'+str(page)+'.html'

    #url='http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html'

     response = urllib.urlopen(url)

   html = response.read().decode('utf-8')

    #print(html)

   pattern = re.compile('(.*?)',re.S)

     blog_address = re.findall(pattern,html)

      print(i[0])

         print(i[1])

    page = page + 1

结果最后部分如下图：

0 个评论

要回复文章请先登录或注册