BeautifulSoup教程（1） - 简介及安装

发表: 2017-04-16 浏览: 1951

Python 开发

最近在学习Python，按照一些博客练习爬虫，最简单的步骤，就是访问一个主页，根据正则表达式去获取我们想要的标签数据；

比如这样：

#加载网址，获取当前页面

def getHTML(url) :

	page = urllib.urlopen(url)

	html = page.read()

	return html



def getImage(html) :

	reg = r'src="(.+?\.jpg)"'

	reg2 = r'<img alt="(.+?)" src="(.+?\.jpg)'

	image_reg = re.compile(reg2)

	img_list = re.findall(image_reg,html)

简单的话，这样还好，如果复杂些的话，像我一样对正则表达式不熟悉的话，可能就不太好实现了，

后面发现这个beautifulSoup解析HTML很方便，这里简单学习下，

官网地址：https://www.crummy.com/software/BeautifulSoup/

还有中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

1. 简介

这里说的不错，Python爬虫利器二之Beautiful Soup的用法

2. 安装

Python里面安装东西很方便，直接使用pip就行了

 pip install beautifulsoup4

3.小例子

我们先写个小例子看看

 # -*- coding: utf-8 -*-



import urllib

import re

from bs4 import BeautifulSoup



#加载网址，获取当前页面

def getHTML(url) :

	page = urllib.urlopen(url)

	html = page.read()

	return html



html = getHTML('https://movie.douban.com/top250')

soup = BeautifulSoup(html, "html.parser")





for img in soup.find_all('img'):

	print img.get('src')

这里，我们就输出了所有的img标签

后面，我们再来继续练习使用

0 个评论

要回复文章请先登录或注册