《Python网络数据采集》读后总结 --第7章清洗脏数据

发表: 2016-04-26 浏览: 3339

Python 爬虫

《Python网络数据采集》这本书的Chpt07清洗脏数据的介绍，

这几章主要是一些具体功能介绍，我就列出一些要点，具体看示例就可以了。

Chpt07.Cleaning Your Dirty Data

这章主要介绍简单的2维分词和基本的清理工作。

同时提了一下OpenRefine,不过这部分介绍的不是很清楚，我补了一些例子。初步花了点时间看OpenRefine，感觉作为DataQuality工具还是很有特色的，

不过Reconcile很多基于Freebase,google宣布放弃Freebase转向Wikidata,这个工具怎么调整，还不知道了。

OpenRefine具体介绍我写了一篇专门介绍。

1-2grams.py.py

2-clean2grams.py

2维分词接不介绍了，后面一章介绍自然语言分析，具体看那个就可以了。

大家就看一下

input = re.sub('\n+', " ", input) #replaces all instances of the newline character

input = re.sub('\[[0-9]*\]', "", input) #去除[11] 这些数据

input = re.sub(' +', " ", input) #replaces all instances of multiple spaces in a row with a singlespace,

input = bytes(input, "UTF-8") #escape charactersare eliminated by encoding the content with UTF-8.

input = input.decode("ascii", "ignore")

item=item.strip(string.punctuation) #去除符号print(string.punctuation) !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

代码：

0 个评论

要回复文章请先登录或注册