Python爬虫之多进程爬取(以58同城二手市场为例)

浏览: 1318

今天以58同城的二手市场为例(也就是转转)给大家介绍一下大规模的结构数据怎么爬取。

分析

先看下转转的网页结构与我想爬取的数据:

类目

物品页

详细页


我的做法是先提取大类目的链接,然后进入爬取物品页的链接,进而爬取详细页的数据,总共建立了3个Python的文件,分别为channel_extract.py,page_spider.py,main.py

channel_extract.py

import requests
from lxml import etree

start_url = 'http://cs.58.com/sale.shtml'
url_host = 'http://cs.58.com'

def get_channel_urls(url):
html = requests.get(url)
selector = etree.HTML(html.text)
infos = selector.xpath('//div[@class="lbsear"]/div/ul/li')

for info in infos:
class_urls = info.xpath('ul/li/b/a/@href')
for class_url in class_urls:
print(url_host + class_url)



# get_channel_urls(start_url)

channel_list = '''
http://cs.58.com/shouji/
http://cs.58.com/tongxunyw/
http://cs.58.com/danche/
http://cs.58.com/fzixingche/
http://cs.58.com/diandongche/
http://cs.58.com/sanlunche/
http://cs.58.com/peijianzhuangbei/
http://cs.58.com/diannao/
http://cs.58.com/bijiben/
http://cs.58.com/pbdn/
http://cs.58.com/diannaopeijian/
http://cs.58.com/zhoubianshebei/
http://cs.58.com/shuma/
http://cs.58.com/shumaxiangji/
http://cs.58.com/mpsanmpsi/
http://cs.58.com/youxiji/
http://cs.58.com/jiadian/
http://cs.58.com/dianshiji/
http://cs.58.com/ershoukongtiao/
http://cs.58.com/xiyiji/
http://cs.58.com/bingxiang/
http://cs.58.com/binggui/
http://cs.58.com/chuang/
http://cs.58.com/ershoujiaju/
http://cs.58.com/bangongshebei/
http://cs.58.com/diannaohaocai/
http://cs.58.com/bangongjiaju/
http://cs.58.com/ershoushebei/
http://cs.58.com/yingyou/
http://cs.58.com/yingeryongpin/
http://cs.58.com/muyingweiyang/
http://cs.58.com/muyingtongchuang/
http://cs.58.com/yunfuyongpin/
http://cs.58.com/fushi/
http://cs.58.com/nanzhuang/
http://cs.58.com/fsxiemao/
http://cs.58.com/xiangbao/
http://cs.58.com/meirong/
http://cs.58.com/yishu/
http://cs.58.com/shufahuihua/
http://cs.58.com/zhubaoshipin/
http://cs.58.com/yuqi/
http://cs.58.com/tushu/
http://cs.58.com/tushubook/
http://cs.58.com/wenti/
http://cs.58.com/yundongfushi/
http://cs.58.com/jianshenqixie/
http://cs.58.com/huju/
http://cs.58.com/qiulei/
http://cs.58.com/yueqi/
http://cs.58.com/chengren/
http://cs.58.com/nvyongpin/
http://cs.58.com/qinglvqingqu/
http://cs.58.com/qingquneiyi/
http://cs.58.com/chengren/
http://cs.58.com/xiaoyuan/
http://cs.58.com/ershouqiugou/
http://cs.58.com/tiaozao/
'''

爬取类目链接比较简单,在这里就不多讲,然后把爬取的类目链接赋值给channel_list变量(具体原因见下面说明)

page_spider.py

import requests
from lxml import etree
import time
import pymongo
# import random

client = pymongo.MongoClient('localhost', 27017)
test = client['test']
tongcheng = test['tongcheng']

headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Connection':'keep-alive'
}

# proxy_list = [
# 'http://218.56.132.158',
# 'http://115.47.44.102',
# 'http://118.144.149.200',
# 'http://222.223.239.135',
# 'http://123.234.219.133'
# ]
# proxy_ip = random.choice(proxy_list)
# proxies = {'http':proxy_ip}

def get_links_from(channel,pages):
list_view = '{}pn{}/'.format(channel,str(pages))
try:
html = requests.get(list_view,headers=headers)
time.sleep(2)
selector = etree.HTML(html.text)
if selector.xpath('//tr'):
infos = selector.xpath('//tr')
for info in infos:
if info.xpath('td[2]/a/@href'):
url = info.xpath('td[2]/a/@href')[0]
get_info(url)
else:pass
else:
pass
except requests.exceptions.ConnectionError:pass

def get_info(url):
html = requests.get(url,headers=headers)
selector = etree.HTML(html.text)
try:
title = selector.xpath('//h1/text()')[0]
if selector.xpath('//span[@class="price_now"]/i/text()'):
price = selector.xpath('//span[@class="price_now"]/i/text()')[0]
else:
price = "无"
if selector.xpath('//div[@class="palce_li"]/span/i/text()'):
area = selector.xpath('//div[@class="palce_li"]/span/i/text()')[0]
else:
area = "无"
view = selector.xpath('//p/span[1]/text()')[0]
if selector.xpath('//p/span[2]/text()'):
want = selector.xpath('//p/span[2]/text()')[0]
else:
want = "无"
info = {
'tittle':title,
'price':price,
'area':area,
'view':view,
'want':want,
'url':url
}
tongcheng.insert_one(info)

except IndexError:
pass

1 try多用,避免大型错误,我也是通过几次调试加上的
2 大规模数据爬取加请求头,代理池,以及请求延迟和断点续爬;请求头不用多说,平时爬取数据都尽量加上,请求头的作用在于Python伪装浏览器爬取;大规模数据爬取数据,时间久了,很容易被ban ip,最好的方法是设置代理池,我这上面有写做法(但由于找的ip不好用,几个ip都不行,后面放弃了);请求延迟是尽量减少访问网站的频次;断点续爬就是说爬取一半由于各种原因停止了,我们需要它接着爬取而不是重新开始(下次有时间给大家介绍下)

main.py

import sys
sys.path.append("..")
from multiprocessing import Pool
from channel_extract import channel_list
from page_spider import get_links_from

def get_all_links_from(channel):
for num in range(1,101):
get_links_from(channel,num)

if __name__ == '__main__':

pool = Pool()
pool.map(get_all_links_from,channel_list.split())

这就是多进程了!!!用法简单,不多说

结果

为了方便看爬取情况,又建立了一个counts.py

import sys
sys.path.append("..")
import time
from page_spider import tongcheng

while True:
print(tongcheng.find().count())
time.sleep(5)

在cmd中运行,现在已经爬取了10w+,还是爬取中。。。。

结果

推荐 0
本文由 罗罗攀 创作,采用 知识共享署名-相同方式共享 3.0 中国大陆许可协议 进行许可。
转载、引用前需联系作者,并署名作者且注明文章出处。
本站文章版权归原作者及原出处所有 。内容为作者个人观点, 并不代表本站赞同其观点和对其真实性负责。本站是一个个人学习交流的平台,并不用于任何商业目的,如果有任何问题,请及时联系我们,我们将根据著作权人的要求,立即更正或者删除有关内容。本站拥有对此声明的最终解释权。

0 个评论

要回复文章请先登录注册