NLP之nltk学习笔记2

浏览: 1940

请先加载


##词链表
sent1=['call','me','David','.']
sent1
len(sent1)
print(sent2)
['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']

In [28]:

print(sent3)
['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']

In [45]:

#连接
a=sent1+sent4
print(a)
text4.index('awaken')
['call', 'me', 'David', '.', 'Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':']

Out[45]:

173

In [29]:

fdist1 = FreqDist(text1)#计算词频

In [30]:

fdist1

Out[30]:

FreqDist({'perusal': 1,
'brimmed': 2,
'puffs': 5,
'smile': 5,
'rather': 68,
'likened': 4,
'dispel': 1,
'Praetorians': 1,
。。。。。。。。。。
'prejudices': 3,
'lower': 57,
'foster': 1,
'cranes': 3,
'thine': 17,
'trodden': 1,
'whereof': 4,
'Fifth': 2,
'robes': 5,
...})

In [34]:

vocabulary1 =list(fdist1.keys())#list在3版本不能省略,因为返回的是iteratable
vocabulary1[:50]#查看前50词频

Out[34]:

['wards',
'Pythagoras',
'snowy',
'perhaps',
'puffs',
'vanquished',
'135',
'smile',
'half',
'articles',
'jackets',
'fickleness',
'rather',
'sufficiently',
'Republican',
'bunting',
'sued',
'spacious',
'dispel',
'digger',
'secret',
'uncouth',
'justification',
'staunch',
'perusal',
'stones',
'footfall',
'Prodigies',
'ticklish',
'patris',
'indifferently',
'insinuates',
'hastily',
'Hurriedly',
'linen',
'deviations',
'Desolation',
'ruling',
'shipmate',
'reminds',
'revolved',
'engrafted',
'forthing',
'shambling',
'unshunned',
'incompetency',
'Gospel',
'provincialisms',
'these',
'early']

In [35]:

fdist1['whale']#出现次数

Out[35]:

906

In [36]:

fdist1.plot(50, cumulative=True)#前五十的累计分布图

In [37]:

fdist1.hapaxes()#返回只出现一次的词

Out[37]:

['Pythagoras',
'vanquished',
'135',
'fickleness',
'Republican',
'bunting',
'sued',
'dispel',
'
'predecessors',
'chimed',
'slopingly',
'properties',

'21',
'quoggy',
'retrace',
'Silent',
'unbegun',
'intestines',
'exploded',
'shelves',
'impudence',
'astronomical',
'BLACKSMITH',
'loosening',
'Isabella',
'patron',
'functionary',
。。。。。。。。。。。。。。。。
'disguisement',
'lieutenant',
'Way',
...]

In [43]:

#细粒度的选择词
V = set(text4)
long_words = [w for w in V if len(w) > 15]#长度大于15的单词
sorted(long_words)
#V = set(text5)
#long_words = [w for w in V if len(w) > 15]
#sorted(long_words)

Out[43]:

['RESPONSIBILITIES',
'antiphilosophists',
'constitutionally',
'contradistinction',
'discountenancing',
'disqualification',
'enthusiastically',
'instrumentalities',
'internationality',
'irresponsibility',
'misappropriation',
'misrepresentation',
'misunderstanding',
'responsibilities',
'sentimentalizing',
'transcontinental',
'uncharitableness',
'unconstitutional']

In [ ]:

fdist5 = FreqDist(text5)
sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])#长度和词频大于7

In [46]:

#词语搭配
from nltk.util import bigrams
list(bigrams(['more', 'is', 'said', 'than', 'done']))#获取他们在文章中的搭配

Out[46]:

[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

In [47]:

text4.collocations()#查找text4中的搭配
text8.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build

In [48]:

###其他统计结果
[len(w) for w in text1]#文中词语长度分布

Out[48]:

[1,
4,
4,
2,
6,
8,
4,
1,
。。。。。。。。。。。。。。。。。。。。。。
1,
4,
5,
5,
4,
1,
2,
...]

In [49]:

fdist = FreqDist([len(w) for w in text1])#每个长度出现的次数
fdist

Out[49]:

FreqDist({1: 47933,
2: 38513,
3: 50223,
4: 42345,
5: 26597,
6: 17111,
7: 14399,
8: 9966,
9: 6428,
10: 3528,
11: 1873,
12: 1053,
13: 567,
14: 177,
15: 70,
16: 22,
17: 12,
18: 1,
20: 1})

In [50]:

fdist.keys()#长度有哪几种

Out[50]:

dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20])

In [51]:

fdist.items()#每个长度出现的次数

Out[51]:

dict_items([(1, 47933), (2, 38513), (3, 50223), (4, 42345), (5, 26597), (6, 17111), (7, 14399), (8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177), (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)])

In [52]:

fdist.max()#某长度出现的次数最多

Out[52]:

3
推荐 0
本文由 ID王大伟 创作,采用 知识共享署名-相同方式共享 3.0 中国大陆许可协议 进行许可。
转载、引用前需联系作者,并署名作者且注明文章出处。
本站文章版权归原作者及原出处所有 。内容为作者个人观点, 并不代表本站赞同其观点和对其真实性负责。本站是一个个人学习交流的平台,并不用于任何商业目的,如果有任何问题,请及时联系我们,我们将根据著作权人的要求,立即更正或者删除有关内容。本站拥有对此声明的最终解释权。

0 个评论

要回复文章请先登录注册