Selenium在异步加载中的应用

发表: 2017-10-12 浏览: 1293

Python

简书文章异步加载

之前爬虫小分队的第一次作业就是爬取简书七日热门，同学们应该知道部分数据是异步加载的，对于阅读，评论，喜欢的抓取数据策略为使用正则表达式匹配，收录专题就是找包来获取数据的。

Selenium代码

from selenium import webdriver

url = 'http://www.jianshu.com/p/c9bae3e9e252'

def get_info(url):

    include_title =[]

    driver = webdriver.PhantomJS()

    driver.get(url)

    driver.implicitly_wait(20)

    author = driver.find_element_by_xpath('//span[@class="name"]/a').text

    date = driver.find_element_by_xpath('//span[@class="publish-time"]').text

    word = driver.find_element_by_xpath('//span[@class="wordage"]').text

    view = driver.find_element_by_xpath('//span[@class="views-count"]').text

    comment = driver.find_element_by_xpath('//span[@class="comments-count"]').text

    like = driver.find_element_by_xpath('//span[@class="likes-count"]').text

    included_names = driver.find_elements_by_xpath('//div[@class="include-collection"]/a/div')

    for i in included_names:

        include_title.append(i.text)

    print(author,date,word,view,comment,like,include_title)

get_info(url)

由于只搞了一个页面的，没有存入数据库，就打印了结果。

代码分析

由于selenium是加载了javascript的，所以我们用chrome浏览器，直接检查的xpath路径就能提取到信息，以收录专题为例，检查元素，来构造xpath路径，这样就不用找包啦。

0 个评论

要回复文章请先登录或注册