知识点
在全美婴儿名字案例中,使用到的方法有:
- 按照
sex
分组按照births
属性求和:groupby("sex").births.sum()
concat()
用法:第一个参数以列表形式添加pivot_table
透视表制作
image.png
- 直接添加某列属性
diff
:group['diff']=group['M] - group['F']
apply()
用法- 查看
DF
数据信息:info()
不同方式绘制可视图:
image.png
image.png
- 查看
DF
数据框中的所有信息value
,除去索引和属性 - 累计求和:
cumsum()
- 归一化后寻找某个分界点的位置:
searchsorted(0.5)
- 对df中的name属性使用func函数:
df.name.map(func)
- 归一化处理:
df/df.sum()
- 挑选不重复元素:
unique()
- 字符串转化:
str.lower()
:一定还要带上str
- 字符串中是否包含:
str.contains()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv(r'D:\Python\datalearning\利用Python进行数据分析\pydata-book\datasets\babynames\yob1880.txt',
names=['name', 'sex', 'births'])
data.head()
image.png
data.groupby('sex').births.sum()
image.png
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
path = r'D:\Python\datalearning\利用Python进行数据分析\pydata-book\datasets\babynames\yob{}.txt'.format(year)
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
names = pd.concat(pieces, ignore_index=True)
image.png
数据透视pivot_table
total_births = names.pivot_table('births', index='year', columns='sex', aggfunc=sum)
total_births.head()
image.png
增加一列数据
def add_group(group):
group['prop'] = group.births / group.births.sum()
return group
names = names.groupby(['year', 'sex']).apply(add_group)
names.groupby(['year', 'sex']).prop.sum().head()
image.png
image.png
pieces = []
for year, group in grouped:
pieces.append(group.sort_values(by='births', ascending=False)[:1000])
top1000 = pd.concat(pieces, ignore_index=True)
分析命名趋势
image.png
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
subset.head()
subset.plot(subplots=True, figsize=(12, 10), grid=False,
title="Number of births per year")
image.png
计算最流行的1000个名字的比例,按照year和sex聚合并绘图
table = top1000.pivot_table("prop", index='year',
columns='sex', aggfunc=sum)
table.plot(title='Sum of table1000.prop by year and sex',
yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))
image.png
df = boys[boys.year == 2010]
df.head(20)
prop_cumsum = df.sort_values(by='prop', ascending=False).prop.cumsum()
prop_cumsum[:10]
image.png
diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')
image.png
最后一个字母的变革
get_last_letter = lambda x: x[-1]
last_letters = names.name.map(get_last_letter)
last_letters.name = 'last_letter'
table = names.pivot_table('births', index=last_letters,
columns=['sex', 'year'], aggfunc=sum)
letter_prop = subtable / subtable.sum()
letter_prop
fig, axes = plt.subplots(2,1,figsize=(10 ,8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female',legend=False)
image.png
image.png
image.png
男孩名字变成女孩名字
all_names = pd.Series(top1000.name.unique())
lesley_like = all_names[all_names.str.lower().str.contains('lesl')]
filtered = top1000[top1000.name.isin(lesley_like)]
filtered.groupby('name').births.sum()
image.png
table = filtered.pivot_table("births", index='year',
columns='sex', aggfunc='sum')
table = table.div(table.sum(1), axis=0)
table.tail()
image.png