作者:黄天元,复旦大学博士在读,目前研究涉及文本挖掘、社交网络分析和机器学习等。希望与大家分享学习经验,推广并加深R语言在业界的应用。
邮箱:huang.tian-yuan@qq.com
任务目标
这次任务目标是:1.在编程环境内创建数据框;2.在本地导入csv文件;3.进行最基本的数据描述性分析。 在开始上代码之前,有必要对数据框这种数据结构进行一定的解释。数据框就是典型的关系型数据库的数据存储形式,每一行是一条记录,每一列是一个属性,最终构成表格的形式,这是数据科学家必须熟悉的最典型的数据结构。
Python
在Python中要使用数据框的类型,需要加载pandas模块。
```python
加载包
import pandas as pd ```
构建数据框
python data = {'year': [2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2012], 'team': ['FCBarcelona', 'FCBarcelona', 'FCBarcelona', 'RMadrid', 'RMadrid', 'RMadrid', 'ValenciaCF', 'ValenciaCF', 'ValenciaCF'], 'wins': [30, 28, 32, 29, 32, 26, 21, 17, 19], 'draws': [6, 7, 4, 5, 4, 7, 8, 10, 8], 'losses': [2, 3, 2, 4, 2, 5, 9, 11, 11] } football = pd.DataFrame( data, columns=['year', 'team', 'wins', 'draws', 'losses']) football
读取csv文件
下面的csv格式数据可以在以下网址取得:https://github.com/DataScienceUB/introduction-datascience-python-book
python edu = pd.read_csv('G:/Py/introduction-datascience-python-book-master/files/ch02/educ_figdp_1_Data.csv', na_values=':', usecols=['TIME', 'GEO', 'Value']) #na_values是把“:”符号认为是缺失值的意思 edu
384 rows × 3 columns
数据基本描述
```python
取前几行
edu.head() ```
```python
取后几行
edu.tail() ```
```python
观察列名称
edu.columns ```
Index(['TIME', 'GEO', 'Value'], dtype='object')
```python
观察行名称
edu.index ```
RangeIndex(start=0, stop=384, step=1)
```python
汇总统计
edu.describe() ```
R
R是面向数据的一门语言,自带数据框类型(data.frame)。不过我们的介绍将会用tidyverse框架下的dplyr介绍,因为它在处理很多问题上更为便捷。
```R
加载包
pacman::p_load(tidyverse) ```
构建数据框
```R year=c(2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2012) team=c('FCBarcelona', 'FCBarcelona', 'FCBarcelona', 'RMadrid', 'RMadrid', 'RMadrid', 'ValenciaCF', 'ValenciaCF', 'ValenciaCF') wins=c(30, 28, 32, 29, 32, 26, 21, 17, 19) draws=c(6, 7, 4, 5, 4, 7, 8, 10, 8) losses=c(2, 3, 2, 4, 2, 5, 9, 11, 11)
football = tibble(year,team,wins,draws,losses)
football ```
读取csv文件
```R readcsv('G:/Py/introduction-datascience-python-book-master/files/ch02/educfigdp1Data.csv', na=":") %>% #na的设置时把“:”认作缺失值的意思 select(TIME,GEO,Value)-> edu
edu ```
Parsed with column specification: cols( TIME = col_integer(), GEO = col_character(), INDIC_ED = col_character(), Value = col_double(), `Flag and Footnotes` = col_character() )
数据基本描述
```R
取前几行
edu %>% head ```
```R
取后几行
edu %>% tail ```
```R
观察列名称
edu %>% colnames ```
- 'TIME'
- 'GEO'
- 'Value'
```R
观察行名称
edu %>% rownames ```
- '1'
- '2'
- '3'
- '4'
- '5'
- '6'
- '7'
- '8'
- '9'
- '10'
- '11'
- '12'
- '13'
- '14'
- '15'
- '16'
- '17'
- '18'
- '19'
- '20'
- '21'
- '22'
- '23'
- '24'
- '25'
- '26'
- '27'
- '28'
- '29'
- '30'
- '31'
- '32'
- '33'
- '34'
- '35'
- '36'
- '37'
- '38'
- '39'
- '40'
- '41'
- '42'
- '43'
- '44'
- '45'
- '46'
- '47'
- '48'
- '49'
- '50'
- '51'
- '52'
- '53'
- '54'
- '55'
- '56'
- '57'
- '58'
- '59'
- '60'
- '61'
- '62'
- '63'
- '64'
- '65'
- '66'
- '67'
- '68'
- '69'
- '70'
- '71'
- '72'
- '73'
- '74'
- '75'
- '76'
- '77'
- '78'
- '79'
- '80'
- '81'
- '82'
- '83'
- '84'
- '85'
- '86'
- '87'
- '88'
- '89'
- '90'
- '91'
- '92'
- '93'
- '94'
- '95'
- '96'
- '97'
- '98'
- '99'
- '100'
- '101'
- '102'
- '103'
- '104'
- '105'
- '106'
- '107'
- '108'
- '109'
- '110'
- '111'
- '112'
- '113'
- '114'
- '115'
- '116'
- '117'
- '118'
- '119'
- '120'
- '121'
- '122'
- '123'
- '124'
- '125'
- '126'
- '127'
- '128'
- '129'
- '130'
- '131'
- '132'
- '133'
- '134'
- '135'
- '136'
- '137'
- '138'
- '139'
- '140'
- '141'
- '142'
- '143'
- '144'
- '145'
- '146'
- '147'
- '148'
- '149'
- '150'
- '151'
- '152'
- '153'
- '154'
- '155'
- '156'
- '157'
- '158'
- '159'
- '160'
- '161'
- '162'
- '163'
- '164'
- '165'
- '166'
- '167'
- '168'
- '169'
- '170'
- '171'
- '172'
- '173'
- '174'
- '175'
- '176'
- '177'
- '178'
- '179'
- '180'
- '181'
- '182'
- '183'
- '184'
- '185'
- '186'
- '187'
- '188'
- '189'
- '190'
- '191'
- '192'
- '193'
- '194'
- '195'
- '196'
- '197'
- '198'
- '199'
- '200'
- '201'
- '202'
- '203'
- '204'
- '205'
- '206'
- '207'
- '208'
- '209'
- '210'
- '211'
- '212'
- '213'
- '214'
- '215'
- '216'
- '217'
- '218'
- '219'
- '220'
- '221'
- '222'
- '223'
- '224'
- '225'
- '226'
- '227'
- '228'
- '229'
- '230'
- '231'
- '232'
- '233'
- '234'
- '235'
- '236'
- '237'
- '238'
- '239'
- '240'
- '241'
- '242'
- '243'
- '244'
- '245'
- '246'
- '247'
- '248'
- '249'
- '250'
- '251'
- '252'
- '253'
- '254'
- '255'
- '256'
- '257'
- '258'
- '259'
- '260'
- '261'
- '262'
- '263'
- '264'
- '265'
- '266'
- '267'
- '268'
- '269'
- '270'
- '271'
- '272'
- '273'
- '274'
- '275'
- '276'
- '277'
- '278'
- '279'
- '280'
- '281'
- '282'
- '283'
- '284'
- '285'
- '286'
- '287'
- '288'
- '289'
- '290'
- '291'
- '292'
- '293'
- '294'
- '295'
- '296'
- '297'
- '298'
- '299'
- '300'
- '301'
- '302'
- '303'
- '304'
- '305'
- '306'
- '307'
- '308'
- '309'
- '310'
- '311'
- '312'
- '313'
- '314'
- '315'
- '316'
- '317'
- '318'
- '319'
- '320'
- '321'
- '322'
- '323'
- '324'
- '325'
- '326'
- '327'
- '328'
- '329'
- '330'
- '331'
- '332'
- '333'
- '334'
- '335'
- '336'
- '337'
- '338'
- '339'
- '340'
- '341'
- '342'
- '343'
- '344'
- '345'
- '346'
- '347'
- '348'
- '349'
- '350'
- '351'
- '352'
- '353'
- '354'
- '355'
- '356'
- '357'
- '358'
- '359'
- '360'
- '361'
- '362'
- '363'
- '364'
- '365'
- '366'
- '367'
- '368'
- '369'
- '370'
- '371'
- '372'
- '373'
- '374'
- '375'
- '376'
- '377'
- '378'
- '379'
- '380'
- '381'
- '382'
- '383'
- '384'
```R
汇总统计
edu %>% summary ```
TIME GEO Value Min. :2000 Length:384 Min. :2.880 1st Qu.:2003 Class :character 1st Qu.:4.620 Median :2006 Mode :character Median :5.060 Mean :2006 Mean :5.204 3rd Qu.:2008 3rd Qu.:5.660 Max. :2011 Max. :8.810 NA's :23
比较分析
首先,从调用的包来说,python没有pandas是不能使用dataframe数据结构的,R则自带有data.frame数据结构,但是在tidyverse生态中的tibble则是增强型的data.frame,能够满足更加便捷的功能。 其次,就构造数据框而言,python用了字典来构造(字典即键-值对),其中每个值对应的是一个列表(注意,R和python的列表概念大不相同);在R中,首相通过构造每个属性的向量,然后把同等长度的向量合并,从而构造数据框。 再次,读取csv文件的时候,两者其实功能相近,不过pd.read_csv可以在函数中选择要读的列,而在R中我是先读了全数据框再选择这些列。另外需要注意的是,python把缺失值表示为NaN,而R中则是NA。 最后,我们发现在描述性数据分析中,两种语言也是大同小异。值得注意的是,python中行名称开头是0,而R则是1.python沿用了计算机的语言,所有开始都是0;而R则更加符合人类的直观理解,开始的一行标注为第1行。
需要提到的是,我在R中用了不一样的编程风格,与传统的R编程也有所不同。比如用“->”来给右边的变量赋值,很多计算机程序员可能会有意见,认为赋值就是右边的东西给左边,其实人类的思考模式向来就是从左到右的,比如“1 + 1 = 2”,再加上管道操作“%>%”也是单向的,这种灵活的编程模式让数据科学家更加自由而富有灵性地在环境中操纵数据。