Python爬虫之简书七日热门数据爬取（异步加载详解）

发表: 2017-10-12 浏览: 1773

Python

最近在家干活，好几天没更新了，最近好友程兄也开始写简书了，大家可以多看看，讲的很详细（我毕竟懒）；正好碰到他问异步加载的问题，那我今天就已简书七日热门为例，给大家讲讲异步加载的数据怎么爬。

异步加载分析

1 先看看网页：

看上去也没什么特别的地方，但往下拉是这样的：

想都不用想，这就是异步加载了，不同于其他网站的网页（是通过按钮进入下一页或者其它页，这种网站可以清楚看到每个网页的网址信息），需要自己找网页链接，我们可以通过chrome浏览器自带的开发者工具或者fiddler软件抓包；我们按F12打开工具，按F5刷新，下拉到加载更多处点击，看出现的包：

找到后我们就可以构造URL了！！！

2 详细页观察

我是在详细页爬取的数据，但阅读，评论，喜欢，打赏和收录的专题都是异步加载（自己挖的坑啊~）

不了解的同学可能会问，你是怎么知道这些数据是异步加载的？一般我们预先不知道，Python爬虫没有得到数据时，我们就去看网页源码，在源码中没有我们要的数据就是异步加载啦。
阅读，评论，喜欢这三个数据还好，在源码的script中（换了个位置），我们可以通过正则来提取：

而打赏和收录专题就需要找包了
先看打赏：

我们可以看到这个url中有个数字，我们在源码中找到了，我们通过这个构造出请求的URL。
再看看收录专题：

代码

from lxml import etree

import requests

import pymongo

import re

import json

import time



client = pymongo.MongoClient('localhost', 27017)

test = client['test']

sevenday = test['sevenday']

embody = test['embody']







header = {

    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'

}







urls = ['http://www.jianshu.com/trending/weekly?page={}'.format(str(i)) for i in range(0,11)]



def get_url(url):

    html = requests.get(url,headers=header)

    selector = etree.HTML(html.text)

    infos = selector.xpath('//ul[@class="note-list"]/li')



    for info in infos:

        article_url_part = info.xpath('div/a/@href')[0]

        get_info(article_url_part)





def get_info(url):

    article_url = 'http://www.jianshu.com/' + url

    html = requests.get(article_url,headers=header)

    selector = etree.HTML(html.text)

    author = selector.xpath('//span[@class="name"]/a/text()')[0]

    article = selector.xpath('//h1[@class="title"]/text()')[0]

    date = selector.xpath('//span[@class="publish-time"]/text()')[0]

    word = selector.xpath('//span[@class="wordage"]/text()')[0]

    view = re.findall('"views_count":(.*?),',html.text,re.S)[0]

    comment = re.findall('"comments_count":(.*?)}',html.text,re.S)[0]

    like = re.findall('"likes_count":(.*?),',html.text,re.S)[0]

    id = re.findall('{"id":(.*?),',html.text,re.S)[0]

    gain_url = 'http://www.jianshu.com/notes/{}/rewards?count=20'.format(id)

    wb_data = requests.get(gain_url,headers=header)

    json_data = json.loads(wb_data.text)

    gain = json_data['rewards_count']

    info ={

        'author':author,

        'article':article,

        'date':date,

        'word':word,

        'view':view,

        'comment':comment,

        'like':like,

        'gain':gain

    }

    sevenday.insert_one(info)





    include_urls = ['http://www.jianshu.com/notes/{}/included_collections?page={}'.format(id,str(i)) for i in range(1,4)]

    for include_url in include_urls:

        include_data = requests.get(include_url,headers=header)

        json_data2 = json.loads(include_data.text)

        includes = json_data2['collections']

        for include in includes:

            include_title = include['title']

            embody.insert_one({'include_title':include_title})



    time.sleep(2)



for url in urls:

    get_url(url)

总结

1 源码中没有数据就是异步加载
2 找包可以看英文意思（初中英语词汇就行）
昨天晚上睡不着，凌晨看了向右奔跑的文章，等会我就以这个为例，讲解scrapy的写法。

0 个评论

要回复文章请先登录或注册