import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
读取文件
- 将表格型数据读取为DF对象的函数
- read_csv:默认分隔符是逗号
- read_table:默认分隔符是制表符('\t')
- read_excel:读取Excel文件
read_sql:读取SQL数据文件
image.png
!type example1.csv
image.png
df = pd.read_csv("example1.csv")
df
image.png
重点:自己创建列属性
print(pd.read_csv("example2.csv", header=None))
pd.read_csv("example2.csv", names=["a", "b", "c" ,"d"])
image.png
names=["a", "b", "c" ,"d", "message"]
pd.read_csv("example2.csv", names=names, index_col="message")
image.png
非固定分隔符去分隔字段的表格处理:
空白符或者其他模式来分隔字段,通过seq指定分隔字段
列名比数据行少,read_table将第一列作为DF的索引
image.png
pd.read_csv("example2.csv",skiprows=[0, 2, 4])
打开文件时:跳过索引为0,2,4的行数据
缺失值处理
- 数据不存在或者为空字符串
- 数据使用某个标记值表示
- pandas通过标记值识别,NA或NULL
result = pd.read_csv("example1.csv", sep=",")
result
image.png
pd.isnull(result)
按块读取文件
pd.options.display.max_rows = 15
result = pd.read_csv(r"D:\Python\datalearning\PandasLearning\NBA_Data.csv", encoding="gb18030")
result.head()
result = pd.read_csv(r"D:\Python\datalearning\PandasLearning\NBA_Data.csv",
nrows=6, encoding="gb18030")
result
chunker = pd.read_csv(r"D:\Python\datalearning\PandasLearning\NBA_Data.csv", encoding="gb18030", chunksize=1000)
tot = pd.Series([])
for piece in chunker:
tot = tot.add(piece['得分'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)
image.png
result.to_csv(r"D:\Python\datalearning\PandasLearning\NBA_Data.csv", encoding="gb18030")
!type D:\Python\datalearning\PandasLearning\NBA_Data.csv
import sys
result.to_csv(sys.stdout, sep="|", na_rep="NULL")
result.to_csv(sys.stdout, index=False, columns=False)
dates = pd.date_range("1/1/2019", periods=8)
ts = pd.Series(np.arange(8), inde
处理分隔字符
- 存储在磁盘上的表格型数据都能用pd.read_table进行加载
- 对于任何单字符分隔符文件,直接用Python内置的csv模块
- 将文件对象传给csv.reader()
import csv
f = open(r"D:\Python\datalearning\PandasLearning\NBA_Data.csv")
reader = csv.reader(f)
for line in reader:
print(line)
with open(r"D:\Python\datalearning\PandasLearning\NBA_Data.csv") as f:
lines = list(csv.reader(f))
header, values = line[0], line[1:]
data_dict = {h:v for h, v in zip(header,zip(*values))}
data_dict
image.png
json数据
- import json
- loads():将json格式转换成Python字典形式
- dumps():将Python对象转成json格式
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""
obj
image.png
如何将json格式数据转成DF数据
- 先将json转成Python对象格式
- DF构造器中传入一个字典的列表
- result['siblings']:取出siblings键中的值
- columns=['name', 'age', "pets"] :定义列属性
image.png
data = pd.read_json(r'D:\\Python\\datalearning\\example.json')
data
print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
print(data.to_json(orient='records'))
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]
二进制数据格式
- 使用Python内置的模块pickle序列化
- to_pickle将数据以pickle格式保存到磁盘上
- pandas内置⽀持两个⼆进制数据格式:HDF5和MessagePack。
HDF5格式
- 用PyTable\h5py\pandas可以读取
- HDF5是⼀种存储⼤规模科学数组数据
- HDF5中的HDF指的是层次型数据格式(hierarchical data format)
- 每个HDF5⽂件都含有⼀个⽂件系统式的节点结构
- 及时压缩、高效存储重复模式数据
- 分块读写数据;不是数据库
image.png
读取Microsoft Excel文件
读取Excel文件有两种方式:
- pandas的ExcelFile类
- pd.read_excel函数
- 扩展包xlrd--->XLS;openyxl--->XLSX
image.png
获取web api
- 通过requests库直接获取api
- 发送HTTP请求,获取requests库
import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
data = resp.json()
data[0]["title"]
issues = pd.DataFrame(data, columns=['number', 'title',
'labels', 'state'])
issues
image.png
数据库交互
- 商业场景下数据一般不会放在Excel中
- 放在基于SQL的数据库中:MySQL、SQL server、PostgreSQL
- PYTHON内置的sqlite3驱动器
import sqlite3
query = """
create table test
(a varchar(20), b varchar(20),
c real, d integer);
"""
con = sqlite3.connect("mydata.sqlite")
con.execute(query)
con.commit()
data = [('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()
cursor = con.execute("select * from test")
rows = cursor.fetchall()
rows
image.png
将数据库中的数据传递给DF中
- 生成上面的元组列表文件
- 列名:光标cursor的description属性中
- 属性是元组形式,取出属性的第一个值
- 传给DF的属性columns
cursor.description
(('a', None, None, None, None, None, None),
('b', None, None, None, None, None, None),
('c', None, None, None, None, None, None),
('d', None, None, None, None, None, None))
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
image.png
关于SQLAlchemy
- 流行的Python SQL工具
- pandas的read_sql函数允许从SQLAlchemy连接读取数据
- 导入:import sqlalchemy as sqla
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite')
pd.read_sql("select * from test", db)