《Python网络数据采集》读后总结--第4-6章使用API、存储数据和读取文档数据

发表: 2016-04-18 浏览: 4591

Python

《Python网络数据采集》Chpt04-Chpt06这几章主要是一些具体功能介绍（使用API访问数据、如何存储数据、解析各类文档内容），我就列出一些要点，具体看示例就可以了。

Chpt04.使用API访问数据

这章主要介绍了Twitter和Google的API,这个我们都不能用，不过国内还有些API好用的，比如高德地图查地址的信息（这部分有空再介绍了）。

示例代码：

1-searchTwitter.py

2-updateTwitter.py

3-getTwitterStatus.py

4-decodeJson.py

5-jsonParsing.py

6-wikiHistories.py

这章有用的内容我觉得是给了一个ip对应信息的api

https://freegeoip.net/json/ipaddresss 可惜好像不能访问了，

找了个替代的，不过信息没freegeoip.net多

http://geoip.nekudo.com/api/50.78.253.58

给个示例：

import json

from urllib.request import urlopen

def getCountry(ipAddress):

    response = urlopen("http://geoip.nekudo.com/api/"+ipAddress).read().decode('utf-8')

    print(response)

    responseJson = json.loads(response)

    return responseJson.get("country")

print(getCountry("50.78.253.58"))

src: {"city":"Milford","country":{"name":"United States","code":"US"},"location":{"latitude":41.2241,"longitude":-73.0517,"time_zone":"America\/New_York"},"ip":"50.78.253.58"}

country: {'code': 'US', 'name': 'United States'}

Chpt05.如何存储数据

前面几章说了网页数据怎么读取，怎么解析，读了尧存下来，这章主要就介绍这些了：

把图片之类保存，把数据存到csv文件，使用mysql数据、还有就是发邮件（需要做稍微修改）

示例代码

1-getPageMedia.py

2-createCsv.py

3-scrapeCsv.py

4-mysqlBasicExample.py

5-storeWikiLinks.py

6-6DegreesCrawlWiki.py

7-sendEmail.py

8-sendEmailWhenChristmas.py

一些有用的代码

就修改了一下发邮件的代码，怎加了smtp的配置，其他和示例代码一致。

#将图片保存到本地

from urllib.request import urlretrieve

urlretrieve('http://www.pythonscraping.com/sites/default/files/lrg_0.jpg','lrg_0.jpg')

 

#mysql 链接操作

import pymysql

conn = pymysql.connect(host='127.0.0.1', unix_socket='/tmp/mysql.sock',

                       user='root', passwd=None, db='mysql')

cur = conn.cursor()

cur.execute("USE scraping")

cur.execute("SELECT * FROM pages WHERE id=1")

print(cur.fetchone())

cur.close()

conn.close()



#email 操作，稍微修改的原来的代码，增加了smtp服务器的设置

import smtplib

from email.mime.text import MIMEText

msg = MIMEText("The body of the email is here")

msg['Subject'] = "An Email Alert"

msg['From'] = "from@163.com"

msg['To'] = "to@163.com"

s = smtplib.SMTP('smtp.163.com',25)

s.login(msg['From'] , 'password')

s.send_message(msg)

s.quit()

Chpt06.解析各类文档内容

这章主要就介绍如何读取各类数据源（文本、PDF、Word），也包括如何解决解码的问题，不过其中word文档Python目前支持不好,有很大限制。

示例代码

1-getText.py

2-getUtf8Text.py

3-readingCsv.py

4-readingCsvDict.py

5-readPdf.py

6-readDocx.py

有用的脚本

#字符集的转换；byte to str的转换

content = bsObj.find("div", {"id":"mw-content-text"}).get_text()

content = bytes(content, "UTF-8")

content = content.decode("UTF-8")

#读取csv文档

data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode('ascii', 'ignore')

dataFile = StringIO(data)

dictReader = csv.DictReader(dataFile)

for row in dictReader:

    print(row)

    print(row['Year'])

附件：

ScrapWithPythonPart1_Chpt04_6.zip

1 个评论

hgz128258

谢谢分享读书写得，一起学习

要回复文章请先登录或注册