Python学习系列之文本词频计数器

发表: 2017-04-19 浏览: 1929

Python

背景

大学考四六级那会儿，要求单词量是痛点，但是考试就一张卷子，记常用的单词，也就是高频词汇是很不错的方法。今天就介绍如何统计高频词汇的方法原理。

代码实现

准备个有内容text文本，这里材料是英文内容的文本。

Clipboard Image.png

import string

path='/Users/zengjiamin/Desktop/World.txt'

with open(path, 'r', encoding='utf-8') as text:

    words = [deal_word.strip(string.punctuation).lower() for deal_word in text.read().split()]

    counts_dict = {word: words.count(word) for word in words}

    for word, count in zip(counts_dict.keys(), counts_dict.values()):

        print('{} - {} times'.format(word, count))





'''

   知识点：

   import string：导入String方法库，处理文本内容的标点符号。

   encoding='utf-8'：默认的是gbk解码，gbk解码中文，但这里是英文内容，所以需要设置，否则报错。

   strip():不带参数删除字符串左右两边空字符，可带参数。

   string.punctuation:可以打印出来，就是一串标点符号字符串。

   split()：分隔函数，默认以空格分隔

   zip():提供两个列表的函数。

   lower():Python区分大小写，避免重复计数，都变成小写。

'''

执行结果(部分截图)

附加内容

使用Collenctions的Counter对象，对某英文文章进行词频统计，并找出出现频率最高的10个单词

from collections import Counter

import re

path='/Users/zengjiamin/Desktop/test.txt'

txt=open(path).read()

C=Counter(re.split('\W+', txt)) # 匹配任意不是字母，数字，下划线 的字符分割

C.most_common(5)

print(C.most_common(5))



'''

知识点：

    collections.Counter对象：将序列传入Counter构造器，得到的Counter对象是元素频率字典

    Counter.most_common(n): 得到频度最高的n个元素列表

    re:导入正则表达式

    \W+: 小写w则表示匹配字母，数字，下划线。 大写W:匹配任意不是字母，数字，下划线 的字符

'''

Clipboard Image.png

0 个评论

要回复文章请先登录或注册